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METHOD AND APPARATUS FOR IMAGING, arc indicated electronically on daily balance sheets, the 

IMAGE PROCESSING AND DATA document itself, under strictures of law and custom which 

COMPRESSION MERGE/PURGE originated hundreds of years ago, actually passes and is 

TECHNIQUES FOR DOCUMENT IMAGE returned to the maker. Electronic funds transfers are also 

DATABASES 5 available, but these transfers do not necessarily require a 

written authorization granted to the recipient of the funds, 
This application is a continuation-in-part of copending and thus do not pose the same paper handling problems. 
U.S. patent application Sex. No. 08/259,527, filed Jun. 14, Accordingly, in the area of check clearance, and, as well, 
1994 pending, which is a continuation-in-part of U.S. patent . with respect to the other instruments, items and documents, 
application Ser. No. 08/224,273, filed Apr. 7, 1994 now to the physical document and its possession and transfer are 
abandoned, and a continuation-in-part of U.S. patent appli- important, since the funds are withdrawn from the drawer or 
cation Ser. No. 08/213,795. filed Mar. 15, 1994 now U.S. maker's account while the check is returned to the maker or 
PaL No. 5,479,486. drawer. Thus, the paper check must generally be physically 

transferred through the banking system. 
FIELD OF THE INVENTION 15 ^ originates from a printing house and 

The present invention relates to the field of automated contains customary preprinted information that is identical 
image processing, image compression and pattern one check of a given style to the next The check 

recognition, as well document-image storage and retrieval becomes a legally operative document upon the inclusion of 
and more particularly to financial instrument processing to handwritten or post-printed information, which also renders 
provide efficient storage and retrieval of check image infer- 20 toe document unique and provides for its special place in the 
mation and detection of fraudulent and authentic collection process. 

instruments, as well as to management of databases con- The originator of the check transmits the check to a point 
taining document-image information, and provision of of collection (e.g.. a lock-box operation that handles bulk 
merge/purge techniques to eliminate redundancies and ^ mailings, or to the payee directly). The relevant infarmarion 
errors. is verified by the payee and the check is endorsed and 

delivered to the bank for deposit to an account At this point 
BACKGROUND OF THE INVENTION i n the process, electronic notation of the transaction is 

performed, while the paper media is physically collected and 
Financial Instruments ^ sent through a clearing system. The paper check is then 

In commercial and savings banking practice, monetary S0Ited and prepared for delivery to the originator's bank, 
transfers often involve documents that include standard, Th e electronic information is used to net out and transfer 
preprinted information (backgrounds, logo's, icons, repeti- funds between banks by the clearinghouse system. The 
rive patterns, fields and the like) as well as post-printed P 3 !*^ dieck ^ to ^ originator bank for sorting, 

information (handwritten entries, names, addresses and the 35 ^biting customer accounts (originator), rnicrofilming, enve- 
like) that render the item negotiable or representative of a l°Pe sniffing and final delivery back by mail to the origina- 
Iegal, binding contract These items are documents that tor. Errors can and do occur at every stage of the process. An 
comprise forms with added information, and include, e.g., crTQI ^ result m a liability which equals or exceeds the 
checks, deposit and withdrawal slips, coupons, travelers' value of ^ transaction, as well as subjects the maker of the 
cheques, letters of credit, monetary instruments, food 40 error to reguktory sanctions. Thus, only a very low error rate 
stamps, insurance forms, title documents, official govern- is tolerated. 

ment forms, tax forms, medical forms, real estate forms, In today's world, it is sometimes inconceivable that the 
inventory forms, brochures, information forms, application cash itself never passes hands but can be electronically 
forms, questionnaire forms, laboratory data forms and the transferred or exchanged, while the document underlying 
like. It is generally desirable to automatically extract rel- 45 the transfer must move from one bank to the next and cannot 
evant information from a form in order to assist in the be electronically transferred. While this may not be the case 
processing of that information. for electronic funds transfers (which are controlled by 

A check is a negotiable fostrument, which is signed by the special legislation and do not typically involve the use of 
maker or drawer, indicates a sum certain of money (or other checks), clearly check transfer processes are antiquated and 
specified value), a date to pay, and a direction to a bank or 50 cannot utilize the wealth of electronic data transfer mecha- 
financial institution to pay to the order of the payee or to nisms unless the integrity of the paper itself is maintain ed, 
bearer on demand. The check thus generally has certain Thus, any system which is employed to improve the effi- 
information or indications preprinted on it, information ciency of the check handling and clearing process should 
which is added to customize the check for the drawer and the maintain the integrity of the information to legal standards, 
payor bank, and information unique to each check written. 55 a ^ so ^ee* customer's demands for reliability, efficiency 
In order for the bank to pay on the item, a check is generally acceptability. 

first endorsed on the reverse side upon tender. Rrocessing In 1988 the Board of Governors of the Federal Reserve 
institutions in the international banking collection and settle- System stated that "the benefits of a nationwide electronic 
ment process will typically each stamp the check with presentment system would not be sufficient to outweigh the 
identifying information, and also provide status relating to 60 costs of a nationwide system." Furthermore, the Board 
dishonor or abnormal circumstances. In normal banking recommended, that the focus of such a system should be on 
procedures, the paper check passes from the maker or image processing to expedite moving the payment through 
drawer to the payee, who then deposits the check with the the system. It was made clear that by the use of image 
payee bank. The paper check is then cleared, for example, interchange, the operational and transportational expenses 
through a central clearinghouse of one or more banks, and 65 would be greatly reduced. Likewise, through truncation 
is sent to the payor bank. While the funds themselves processing, the amount of stored and transmitted informa- 
typically do not, in a physical sense, pass hands, but rather tion can be minimized. Thus, the benefits of a check trun- 
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cation system are clearly taught in the art. However, past 1) Correlation approach — a traditional approach encom- 

analyses have indicated that such systems are expensive. passing signal processing and statistical decision theory 

While the use of truncation processing can rninimize the concepts. 

information to be transmitted and stored past systems have 2) Feature approach— whereby pixel-by-pixel 

generated a relatively large file for each document so 5 intensity variations are ignored in favor of selected 

processed, so that this burden is not considered trivial. measurable features or attributes and relations of an 

It is known in the art of digital data storage and compres- • ^ texmre or color 
sion to compress data by compiling code libraries of infor- « i *=' t * u- u a * *i a a 
mation inT digital data so7ce file, with a code Ubrary 3)Rekuonalir*^g-where detadedcoirespondences 
derived from the data to be compressed or with a code m between the images include geometric relaUonships 
Ubrary having a content based onl predicted likely infor- 10 between selected components. This provides for mod- 
mation content of the source file, resulting in a compressed eUiD g of an m * lMdS to mMe effident 
file if the source file is represented as a series of pointers to further Pressing by being able to prioritize the land- 
portions of the code Ubrary. when the code library contains mlb amending upon their particular semantic sig- 
enough sequences in common with the source file. Thus, the ,« nificance. See Pratt, W. K., Digital Image Processing 
series of pointers to the code Ubrary can be represented by 15 and Fischler - M - & Ftochein. O. Intelligence, The Eye, 
a smaller information content signal than the data source file rae Bmin and Computer for more informauon on 
itself. Further.it is known that such code Ubraries may be „i ma ? e matching techniques, 
adaptive and updated to include information from a digital ^ features 816 used as a n"* 05 »* matching, various 
data source file, which may be repeated elsewhere, thereby codebooks can be created, each with its own set of features, 
effecting a lower data storage requirement Code Ubraries Thus, the algorithm can choose a cc^<*ook deeding upon 
may also be purged of information which docs not appear in me features me document and therefore dramatically 
files to be compressed. A limited size code Ubrary offers two reduce the search time. 

advantages: first it limits the search time to match a Model-based data compression is also a known concept 

sequence in the source file with a sequence in the code „« 11 a model-based compression system, certain characteds- 

library; and second, it limits the size of an individual pointer tics of the data to be compressed are presumed or predicted, 

and therefore aUows the compiled series of pointers to have 1,1 a model-based system, ordinarily, an expert studies the 

an optimum length characteristics of the type of data, and designs a codebook 

A match between a stored template and a scanned image optimized for the expected data. The information content of 
is rarely exact e.g. because of noise introduced when „ * e ^ be a*stantiaUy reduced by tatang into con- 
scanning or skew of the scanned image that may have been slA ™ ti ™ ' characteristics common to a significant 
introduced when handling the paper document. This is so ^^ d ^^ l ^J lu ^, a ^'^^ A 't 
even when the template is only a portion of the input image ««* data signal to be compressed and 
orwenwhenmematchisbasedonfeaturesratherthanpixel *?mDdd. or a selected model encompassed by the system 
values. Consequently, a frequent method of matching a « wmch «?"» *c "Mevant data to be further processed. Of 
template is done by using a distance measure. d(mji), ? urse ' ^ a mc^ completely desenbes the source ^data. then 
betweenmetemplateandmeiniageataUpointsmmeiirage *° combed data consuls of merely an identification of 
field. Themputimageisdeemedtobeain^whenevertoe tex^V^mMmfab^touMjc* 
distance is less than a reestablished threshold (X). The deviations from the model which are insignificant, without 
distance function D(mxJJ) is computed at a variable «, substantially increasing the amount of information which is 
startrngpowtrntoemputirnagelagato "^f^ " to des ? ib , e *•» source Therefore a 
Because of the skew or noise, the search of toe input image model-based system may include one or more models which 
may be at some localized area for a matching templatT^ characterize prototypic data, and an unknown signal is then 

tu„ tr i \ a 4. .u • . • . i. L. . . matched to a selected model, and processed to eliminate 

ivTX ^ P "V^eetobe searched and information mduded m me 

0, ^.^ tM,platC ™? * 6 SearC ? * 45 It is further known that a large number of images or 

constrained over some region of I(mJi). of the image where . . 7~ " ™° . °! 

OSmSlM and OSnSN, for example. The pixels are then compressed images may be stored m a storage dev.ee. These 

index points of the image as a ranje over a matrix. By way ^ b ° * fo f a Precognition 

of exarnple. the index L start at the lower left most pixel W for matching an unknownpattem against the imag<* 

„t „„ i3„„. ,u /a a\ • • ■ . m the database. The storage medium may be RAM. ROM, 

of an image as the position (0,0) in a typical coordinate 50 cpnnvi unponM rL„ n »n„ 

system. One common distance function used is where the ^^^M, flash memory, magnetic storage 

difference is defined as: r^ra J1 XT)=iVE t [l(j+m.K+m>-Ttj. ™ ediUm ' sto 5 a S e med f n ' "W^ 

K)] 2 . A template match exists at coordinate location (m,n) if: Sf m * dwm ' ™°&W*c "fge forage medium, an 

IXnxjiXTVX. optical storage medium and other known systems. The 

e . ' . 4 . . . , . n ^ . images stored in these databases may provide a very large 

Since many plates exist in the database, B(I) is 55 nun |er of templates or models agains^lnc!, an image or 

donated as the closest matching template for a database of ^ „ to be ^ ^ ^ ^ 

templates. F={xl where x is a template} and is defined as used to select a best match. 

H X 5' ff ? (nM,Jj ? 18 J .fmimunL The Automatea nandwri ting extraction from documents and 

matdung of the templates is comphcaled by a number of recognition toacof fa ^ Handwriting recognition 

problems. e.g. shifts rotational differences or scale so ^ ^ used for computa information input Known optical 

duTerences. when p«el-by.pixel processmg is necessary. It character recognitl ^ systems ^ to read F ^ 

is therefore often important to spatwUy register the two intelpret handwriting. Systems are also available to extract 

E K S^SST" y • te ^f !Ue « ^ handwritten information from electronic images of forms, 
known in the art that deal with image registration. Such 

techniques improve upon the template matching process. 65 Database Management 

A number of image matching techniques are known and Merging and coalescing multiple sources of information 

used in the art. Generally, these fall into three categories: into one unified database requires more than structurally 
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integrating diverse database schema and access methods. In frame computing solutions. Further, since these organiza- 

applications where the data is corrupted. i.e. is incorrect, tions have previously implemented mainframe class 

ambiguous, having alternate forms or has changed over solutions, they typically have already made a substantial 

time, the problem of integrating multiple databases is par- investment in hardware and software, and therefore will 

ticularly challenging. This is known as the merge/purge 5 generally define the problem such that it will optimally be 

problem. Merging information requires so-called semantic addressed with the existing database infrastructure, 

integration, which requires a means for identifying equiva- Routinely, large quantities of information, which may in 

lent or similar data from diverse sources. The merging some instances exceed one billion database records, are 

process must then determine whether two pieces of infor- acquired and merged or added into a single database 

mation or records are of sufficient similarity, and that they 10 structure, often an existing database. Some of the new data 

represent some aspect of the same domain entity, by means or information to be merged from diverse sources or various 

of sophisticated inference techniques and knowledge of the organizations might, upon analysis, be found to contain 

domain. irrelevant or erroneous information or be redundant with 

A very large database is one in which it is unfeasible to preexisting data. This irrelevant, erroneous or redundant 

compare each record with every other record in the database, 15 information is purged from the combined database, 

for a given operation. Therefore, a simplifying presumption Once the data is merged, other inferences may be applied 

is necessary in order to ensure the integrity of the data t o the newly acquired information; e.g. new information 

records, such as when a batch of new records is added to the may oe gleaned from the data set, The ability to fully analyze 

database. In general, this presumption is that a predeter- the data is expected to be of growing importance with the 

mined subset of the database records may be selected in 20 corning age of very large network computing architectures, 

which a cross comparison of the records within the subset ^ merge / pur g e pro blem is closely related to a nmld- 

wmbeeffecuvetoeiisurememte^^ way join over a plurality of large database relations. The 

to within a reasonable limit simplest known method of implementing database joins is 

In the field of mailing list verification, the database by computing the Cartesian product, a quadratic time 

integrity is generally ensured by first sorting the database 25 process, and sdectLagtherelevanttuples.lt is also known to 

according to a criteria, then selecting a window of consecu- optimize this process of completing the join processing by 

tive sorted records, and then comparing the records within sort/merge and hash partitioning. These strategies, however, 

the Window With each Other. The purpose is to eliminate assume a total nrHpring nvw thft HnT nain nf thft join flttrihiitfts 

duplicate records, so that within the window, records which ^ or a "near perfect" hash function that provides the means of 
appear to correspond are identified as such, and an algorithm inspecting small partitions (windows) of tuples when corn- 
is executed to select a single record as being accurate and to puting the join. However, in practice, where data corruption 
eliminate any other corresponding records. This known i s the norm, it is unlikely that there will be a total ordering 
method, however, will not eliminate records which are of the data set, nor a perfect hash distribution. Known 
corresponding and yet are not present within the window. implemented methods nevertheless rely on these presump- 
Further, the comparison algorithm may not perfectly identify tions. Therefore, to the extent these presumptions are 
and eliminate duplicate records. This problem will exist with violated, the join process will be defective, 
respect to financial documents, like checks and other items, The fundamental problem is that the data supplied by the 
^ we *l- various sources typically includes identifiers or string data 

Known very large database systems may be maintained ^ that are either erroneous or accurate but different in their 

and processed on mainframe-class computers, which are expression from another existing record. Hie "equality** of 

maintained by service bureaus or data processing depart- two records over the domain of the common join attribute is 

ments. Because of the size of these databases, among other not specified as a "simple** arithmetic predicate, but rather 

reasons, processing is generally not networked, e.g. the data by a set of equational axioms that define equivalence, thus 

storage subsystem is linked directly to the central processor 45 applying an equational theory. See S. Tsur, 'TODS invited 

on which it is processed and directly output talk: Deductive databases in action'*, Prvc. of the 1991 

Other database processing methods are known, however ACM-PODS: Symposium on the Principles of Database 
these have not been applied to very large databases. This is Systems (1991); M. C. Harrison and N. Rubin, "Another 
not a matter of merely database size, but rather magnitude. generalization of resolution", Journal of the ACM f 25(3) 
In general, the reason for ensuring the integrity of a mailing 50 (July 1978). Hie process of detenriining whether two data- 
list database is a matter of economics, e.g. the cost of base records provide information about the same entity can 
allowing errors in the database as compared to the cost of be highly complex, especially if the equational theory is 
correcting or preventing errors. Of course, when these intractable. Therefore, significant pressures exist to mini- 
databases are employed for other applications, the "cost** of mize the complexity of the equational theory applied to the 
errors may be both economic and non-economic. Often, 55 dataset, while effectively ensuring the integrity of the data- 
databases are maintained for many purposes, including base in the presence of syntactical or structural irregularities, 
mailing list, and thus the costs may be m determinate or The use of declarative rule programs implementing the 
incalculable. equational theory to identify matching records is best imple- 

The semantic integration problem, see ACM SIGMOD mented efficiently over a small partition of the data set In 

record (December 1991), and the related so-called instance- 60 the event of the application of declarative rule programs to 

identification problem, see Y. R. Wang and S. E. Madnick, large databases, the database must first be partitioned into 

'The inter-database instance identification problem in inte- meaningful parts or clusters, such that t4 raatching" records 

grating autonomous systems'*, Proceedings of the Sixth arc assigned to the same cluster. 

International Conference on Data Engineering (February Ordinarily the data is sorted to bring the corresponding or 
1989), as applied to very large databases are ubiquitous in 65 matching records close together. The data may also be 
modem commercial and military organizations. As stated partitioned into meaningful clusters, and individual match- 
above, these problems are typically solved by using main- ing records on each individual cluster are brought close 
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together by sorting. This basic approach alone cannot, to the present invention will have an information content 

however, guarantee the "mergeable" records will fall in a significantly smaller than the original, complete scanned 

close neighborhood in the sorted list. image, which can be, e.g., about a few hundred bytes, or less. 

The standard check image compression systems hereto- 

SUMMARY AND OBJECTS OF THE 5 fore known, however, produce files having an information 

INVENTION content of at least about 40 kilobytes or more at a lower 

™_ , . , . , . „ . . . resolution than 600 dpi. 

The foregoing and other objects of the invention are B processing the checks early in the chain of collection 

achieved herein by a system and method for processing events 5 whfle extracting ^ of ^ necessary information 

images in the form of, e.g a financial or standardized type from ^ check ^ t , tem eliminates me require- 

of document, comprising the steps of scanning the image to mcnt of physically sending mc 4^ from ^ to 

create a first digital image of the document; comparing the lace ^ aversion. reducmg me possibility of errors at 

first digital image against a codebook of stored digital ^ various subseque nt points. Therefore, once a check is 

images, or features of an image or information relating properly ^ed ^ m e integrity of the data ensured, the 

obje«sinanirnage;matc^ paper origmal rnay be destroyed or. at a rmnimurn. may be 

of the stored digital unages, or features of an image, or stQred remotel from ^ loaidon of Ae ^ 

urfonnahon relating objects in an image; producing an index sett iement process. In fact, trader the present invention, the 

code identifying the stored digital image as having matched stored compressed digital image can. and should replace the 

toe fn-st digital imag^ paper chect and merefore me papa ch<^ can be eliminated 

stored digital image from the first digital image to produce entirdymercby preventing fraud through dupUcateprocess- 

a second digital image; and storing the second digital image a { w of paper 

with the index code. ™ . - ... . 

The present system therefore allows the 'truncation" of a 

Additionally, the invention includes the use of a rule- paper mto ^ electronic image form. This allows the 

based system for merging databases containing such finan- transfer of reliable and secure information between the 

rial document images, which declarauvely uses an equa- ^ various parties without need for the physical transfer of 

tional theory for ensuring database integrity. Further, papen Security of the digital data may be ensured by various 

according to the present invention, very large databases are encryption methods, e.g. public key/private key systems, 

accommodated, databases which are so large that parallel signature standard, digital encryption standard and 

and distributed computing systems are preferred for achiev- otner known secure encryption systems. The electronic 

ing an acceptable performance in a reasonable amount of ^ message may also be time stamped The paper check may be 

time with acceptable cost. The present invention preferably truncated at any of a number of points in the processing 

employs the so-called sorted neighborhood method to solve chain. The check may be truncated by the payee, through use 

the merge/purge problem. Alternatively, a so-called cluster- of a device which> m a secure reliable manner, scans the 

ing method may also be employed. check image and destroys or permanentiy defaces (or stores 

More particularly, one aspect of the invention comprises 35 remotely) the original. In a lock box operation, where the 

optically scanning personal or corporate checks to produce payee has an agent to collect payments and process them, the 

a digital signal and converting the data, through a model- operator of the lock box operation may also truncate the 

based or other compression algorithm, to produce a signifi- checks, and transmit the information to the appropriate 

cantly smaller data file, thus reducing the amount of memory financial institution, as well as the contracted party or any 

needed from, e.g., about one megabit per check to only a few 40 other interested party. The check may also be truncated at the 

hundred bytes per check. payee's bank, where the payee presents the check for 

A standard personal check measures approximately 36 deposit. In this case, the security is less important, although 
square inches, while corporate checks can vary, for example, the integrity of the system depends on a fail-safe system. The 
from about 20 to 60 square inches. A simple approach to central clearinghouse may truncate the paper check, either 
scanning with high resolution (600x600 dpi) can therefore 45 before or after processing. This clearinghouse may also 
lead to an image requiring a minimum of one megabit of include a special check truncation unit, as a function thereof, 
information per check. The present system, however, and the truncation need not be a part of the existing 
employs a model-based image analysis and compression clearinghouse infrastructure. The payor's bank may also 
system, which separates and extracts a variable foreground truncate the check, although some of the advantages pro- 
portion of the check from a form or background portion of 50 vided by aspects of the present invention may not be fully 
the check, matches the form portion with one of a plurality achieved. Finally, the payor may truncate the check saving 
of stored templates to allow substantial compression by substantially the cost of mailing back the checks to the 
representing the background as merely a simple identifying originator. 

code. The system optionally determines an error between the The memory reduction facilitated by the present invention 

matched stored template, determines whether the error is not 55 is done by separating the background of the check, i.e. the 

significant, and ignores the error if it is insignificant and check style Mormation, from the personalized, foreground 

subjects the check image to further processing if the error is information, i.e. the handwritten, post-printed or added 

significant. The foreground image, separated from the back- portion. In some cases the background information may 

ground image and compensated for error, is further sub- include preprinted signatures or other identifying 

jected to an optimal compression adapted to the particular 60 information, e.g. corporate or private names and addresses, 

type of information, e.g. handwritten or printed. Once the background information is determined or defined, 

Alternatively, the invention also comprises a method that it is maintained in a library and only an index code associ- 

does not involve a first determination of form portion and ated with that background need be maintained with the 

variable portion (or background and foreground) but rather foreground, personalized information to represent a check 

seeks the existence of a match with the template, using an 65 image. The background can then be deleted or eliminated 

algorithm which is tolerant to the existence of a variable from the stored check information (except for the identifying 

portion. Either way, the compressed information according code), thus reducing the amount of memory required. In 
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some cases the identifying code may be represented throughput and because of the high computational expense 
implicitly, therefore eliminating it entirely from the stored of the image pattern matching operation. In order to keep the 
information. Thus, the invention includes, as a feature, the size of the database to reasonable proportions, a compressed 
creation and maintenance of a codebook library of scanned version of the image may be maintained. When the corn- 
check information, in a suitable storage form, e.g. actual 5 pressed image is retrieved, some preprocessing may need to 
image or compressed image data of various resolutions, that be done to decompress the image before attempting to 
can be used to regenerate the actual image data, through the compare or cross correlate it to the check image, 
use of an algorithm executed by a computer or a series of The system according to the present invention preferably 
mathematical equations that can compare the features and employs to full advantage any clues provided on the check 
relationships (e.g. geometric) between the codebook and the 10 as to the identity of the form, such as the preprinted codes 
actual regenerated image. Where the code has been elimi- referred to above, an identification of the manufacturer of 
nated and is only implicit within the rest of the stored image, the check, information relating to the manufacturing or 
the computer algorithm can regenerate the actual image printing process employed in making the check, and other 
through this implicit knowledge. By way of example, if the information that may limit the available choices of check 
issuing bank only uses one background for all their checks, 15 type, e.g. account number or customer identification that 
by the algorithm knowing the issuing bank, the algorithm may be associated with check types that is typically used by 
implicitly knows the background without need of a code. that customer as may be learned by a computer system. 

The present system also includes a system for positively Further, the processing is preferably time limited, so that 
identifying the background image of a check through a code after a predetermined period of time, the matching process- 
contained thereon, which may be a zebra code, bar code, 20 ing ceases and the check is represented as a compressed 
two-dimensional code, numeric or alphanumeric code, image based on the image data and the processing performed 
invisible optically readable code, or magnetic readable code. in the matching process. If the processing is prematurely 
This code allows the easy access of a corresponding image terminated, that check may optionally be later processed by 
of a blank check form from a storage medium, which allows an exception handling system. Further, in cases of an incom- 
easier identification of the added information on the fore- 25 plete match, a check may be described by its relation to other 
ground through image subtraction, and provides positive templates or portions of templates in the database, which 
identification of the check style. While it is not necessary to may be described in a smaller number of bytes than a 
provide the image stored in a database, providing such an non-model based de novo description. In addition, a check 
image will allow the verification of the image code and image may also be described in relation to combinations of 
allow the identification forgeries and frauds. The preferred 30 transforms applied to defined patterns or portions of other 
storage medium, where a search of the entire database to available images in the database. 

determine the existence of a match is CD-ROM, which Many checks have a repetitive texture or pattern in the 

allows the cost effective storage and updating of a large background region. These repetitive background patterns 

number of check forms. lend themselves to efficient compression and/or analysis 

When a background subtraction of a check form is 35 using Fourier coefficients, wavelets, fractal (iterated fiino 

accurate, the remaining information on the check may be tion system) transforms, and/or spatial pattern analysis. Such 

compressed to about, e.g. 1,000-2,000 bytes of information processing is expected to greatly reduce the amount of 

or possibly smaller depending upon the amount of fore- relevant data, while retaining information necessary for 

ground information. Further, if a model-based algorithm is reconstructing the check image, even if a full background 

employed on the added information, such as splines or 40 template matching function is not completed on the check, 

wavelets representation of the handwritten or an optimized The present system for scanning checks may be advan- 

compression of the post-primed information, only a small tageously employed with a fraud detection system. The 

amount of information, e.g. the spline control points need to scanner preferably has a high resolution, e.g. between 

be stored, and the entire check may be represented 300-800 dpi to produce resultant images of very high 

dirniriimously, as, for example, a few hundred bytes of 45 quality. Certain new check styles include finely printed 

information. See Three Dimensional Graphics, Chapter 21, information in, e.g. 1 point type, which cannot be scanned 

**Curves and Surfaces", pp. 309-330. using normal scanning equipment, for the purpose of pre- 

When identifying coding of the checks is not available, a venting electronic copies or forgeries. The present system 

pattern recognition system can be employed according to the may scan the entire image at extremely high resolution, in 

present invention to match the background image of a check 50 order to properly detect this information. After data 

being processed with a database of available images of compression, the results of such a high resolution scanning 

check backgrounds. In such an instance, it is preferred that system will be small. Subtle variations and "flaws" may thus 

the image being analyzed be preprocessed, i.e. subjected to be detected in the check background. Therefore, if such 

an early stage analysis, to determine certain characteristics flaws are intentionally placed by the manufacturer, the 

thereof. The images in the database are indexed according to 55 image contained on the database could include an identifi- 

similar or identical characteristics or criteria. Therefore, cation of these "flaws", which would be searched for on the 

early in the processing of a check, it is possible to define a checks being scanned. Thus, a match would also determine 

subset of the database which possess corresponding charac- the authenticity of the check form, thereby thwarting certain 

teristics to the check. This preliminary matching procedure types of frauds. Likewise, any checks which are scanned are 

may be conducted for a number of different characteristics, 60 analyzed for the presence of artifacts not present on authen- 

and may be conducted in parallel by a number of processing tic documents. These may be present due to falsification of 

units. Thus, a small subset of the overall database of the document Therefore, the present system may assist 

templates, having characteristics closely matching the check banks in detecting and preventing fraudulent transfers. It is 

being processed, may be defined for further processing. It is noted that certain kinds of flaws may only reasonably be 

preferred that only a small number of images be retrieved 63 detected through an electronic analysis, and therefore the 

from the database for direct comparison or cross correlation present system allows new, enhanced methods for ensuring 

with the check image, because of the need to maintain high the integrity of the system. 



5 ? 668,897 

11 12 

The present system preferably analyzes the foreground fields, comprising the steps of computing a first key for each 
information on a check in order to ensure the availability and record in each table by extracting at least a portion of a first 
consistency of necessary information on the check prior to field; sorting the records in each data list using the first key; 
passing through the check clearing process. Thus, missing comparing a predetermined number of sequential records 
information may easily be detected. Further, inconsistency 5 sorted according to the first key to each other to determine 
between the courtesy figure (e.g. dollar amount in decimal) if they match; storing identifiers for any matching records; 
and written amount on the check may be detected. computing a second key for each record in the table by 
The prior art teaches the sorting and searching of data- extracting at least a portion of a second field; sorting the 
bases. See Gotlieb & Gotlieb, Data Types and Structures. records in each data list using the second key; comparing a 
Chapter 4, 97-155, Prentice-Hall (1978). i° predetermined number of sequential records sorted accord- 
Prior art teaches parallel processing and parallel opera- in S t0 me second ke y t0 ™ ch other to determine if they 
tions. See U.S. Pat No. 4,860.201, Binary Tree Processor match; storing identifiers for any matching records; and 
and U.S. Pat No. 4.843,540, Parallel Processing System, subjecting the union of said stored identifiers to transitive 
both to Salvatore J. Stolfo. closure. 

However, the prior art fails to teach techniques that can be 15 According to the present invention, a further aspect 

employed successfully for managing large databases of includes a method in which at least one of said comparing 

financial document image information. stc P s comprises applying a rule-based equational theory to 

It is thus an object of the invention to create a system, me recorC k. 

method and apparatus for creating and using a codebook of ^ * s an object of the present invention to provide a 

data images or features or relationships of financial instru- 20 method including a step of eli mi na t ing all but one of any 

ments or standardized documents, scanning a negotiable or duplicate records from said database based on said transitive 

other instrument, producing an image, comparing the closure. 

scanned image against the codebook, subtracting the code- It is a still further object according to the present invention 

book image information from the scanned image, compress- to provide a method in which the step of initially partitioning 

ing the left over handwritten or post-printed information, 25 the records into clusters involves using a key extracted from 

and taking the final reduced product and storing and/or the records. 

processing same in accordance with banking procedures. A still further object of the invention provides for com- 

It is also an object of the invention to employ parallel puting a first key step comprises scanning clusters of records 

processing techniques to accelerate the processing of an 3Q in sequence, and for each scanned record extracting an 

individual matching operation, comparison of an unknown n-attribute key, which is mapped into an n-dimensional 

object against a set of codebook templates in parallel, and cluster space. 

the processing of a multiplicity of unknown images in Another object according to the present invention pro- 
parallel, vides a method wherein the comparing step comprises 
It is a further object of the present invention to substan- 35 comparing the records according to a characteristic selected 
bally reduce the storage requirements and management of from the group consisting of edit distance, phonetic distance 
large archival storage of many check images and to improve and typewriter distance. 

the speed of accessing and retrieving individual check Another object according to the present invention pro- 
images and the long term storage requirements of older vides for selecting a key from the group consisting of last 
existing microfilmed check images, which are typically ^ name, first name, address, account number, social security 
maintained by the banking system for about 7 years. number and telephone number. 

It is another object of the present invention to provide Still another object according to the present invention 
variable-size or scaled check images retained on storage provides a method further comprising the step of pre- 
media, including decompression by utilizing codebook code processing the records in the database using a thesaurus 
to render full color and faithful reproductions of archived 45 database to mdicaterelatedness. The thesaurus database may 
check images. include linked records indicating related names and nick- 
It is an object of the present invention to provide a method names in a plurality of languages. The preprocessing step 
for identifying duplicate records in a database of financial may also include the step of employing a spell checker to 
document images, each record having at least one field and correct misspellings in the records. The spell checker pref- 
a plurality of keys, comprising the steps of sorting the 50 erably includes the correct spellings of known cities, and is 
records according to a criteria applied to a first key; com- employed to correct the spelling in a city field of a record, 
paring a number of consecutive sorted records to each other, Another object according to the present invention pro- 
wherein the number is less than a number of records in said vides a parallel processing method in which a separate 
database and identifying a first group of duplicate records; processor is employed for comparing a predeterrnined num- 
storing the identity of the first group; sorting the records 55 ber of sequential records sorted according to the first key to 
according to a criteria applied to a second key; comparing a each other to tetermine if they match, and an additional 
number of consecutive sorted records to each other, wherein processor is employed for the comparing a predetermined 
the number is less than a number of records in said database number of sequential records sorted according to the second 
and identifying a second group of duplicate records; storing key to each other to determine if they match. The database 
the identity of the second group; and subjecting the union of go is preferably sorted in parallel using parallel merge sorting, 
the first and second groups to transitive closure. The selec- a further object according to the present invention pro- 
tion of keys can be specific to particular backgrounds sought vides a method, wherein: N is the number of records in the 
to be subtracted, thereby enabling background image sub- database, P is the number of processors, each processor p, 
traction to occur after sorting. l^p^R being able to store M+w records, where w is the 
It is a further object according to the present invention to 65 size of the merge phase window, and M is a blocking factor, 
provide a method of merging two tables of records of P is less than N, MP is less than N, andr, represents record 
financial document images, each record having a plurality of i in a block. O^i^MP-1 , comprising the steps of dividing 
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the sorted database into N/MP blocks ; processing each of the FIG. 10 is a graph of the ideal performance of the method 

N/MP blocks in turn by providing each processor p with according to the present invention; 

records r^^, . . . , r pM _^ .... r pM+H ^ 2? for l=p^P, fiGS. 11A and 11B are two graphs of time results for the 

searching matching records independently at each processor sorted-neighborhood and clustering methods on a multipro- 

using a window of size w; and repeating the processing step 5 cessor system* 

for the next block of records. _, c ' . ^ , £i . ^ £ . 

A stiU further object according to the present invention f^f " B ™ ^ results forthe 

providesameth^AereinNisthenumberofrecordsinthe sorted-neighborhood and clustering methods for different 

database, P is the number of processors p, and C is the sue databases, 

number of clusters to be farmed per processor p, comprising 10 I^G* 13 is a flow chart representation of the steps involved 

the steps of dividing the range into CP subranges; assigning m sorting* comparing and storing identifiers, in accordance 

each processor C of the subranges; providing a coordinator with a preferred embodiment of the invention; and 

processor which reads the database and sends each record to FIG. 14 is a flow chart representation of the steps involved 

the appropriate processor; saving the received records at j n creating a union of stored identifiers, and subjecting the 

each processor in the proper local cluster and after the union to transitive closure, 
coordinator finishes reading and clustering the data among 15 

the processors, sorting and applying the window scanning DETAILED DESCRIPTION OF THE 

method to the local clusters of each processor. The coordi- PREFERRED EMBODIMENTS 

nator processor load balances the various processors using a _. . ^ _ 

simple longest processing time first strategy. Financial Instruments 

A further object according to the present invention is to » In FIG. 1, the overall system architecture of logic is set 
provide an apparatus for identifying duplicate records in a forth m accordance ^ a preferred embodiment of the 
database, each re<^d having at least one field and a plurality P TGSGUt invention. Box 1 sets forth the first stage of the 
of keys, comprising a storage medium for storing said ? rocess > where * the check (or other document or 
records of the database; a connection system for selectively instrument, as the case may be) is scanned and a scanned 
transmitting information from the database; and a processor * image, reflecting the check, is created and stored. It is 
having a memory, saidpro<^sorrec^vmgiiiformation from understood that any of a number of known hardware scan- 
said connection system, for sorting the records according to ners can be utdlzed for ^ Purpose, and any of a number of 
a criteria applied to a first key; comparing a number of ^ own software Packages can be used to scan and store the 
consecutive sorted records to each other, wherein said V^? n ^^ i ^ Il ^ 0 ot^' P^ferred scanner has a resolution of 
number is less than a number of records in said database and 30 600x600dpi, 6 to 8 bits of gray scale or 1S-24 bits of color 
identifying a first group of duplicate records; storing the information, with a selectively engageable increased reso- 
identity of said first group in said memory; sorting the lution mode of around 2400x2400 dpi or m^er for areas of 
records according to a criteria applied to a second key; in^easeddetaiLT^ 

comparing a number of consecutive sorted records to each vAMn rcsolutlon detail is expected in a portion of the 

other, wherein said number is less than a number of records 35 image, as predirted by airbed template or model and for 

in said database and identifying a second group of duplicate capture of, e.g., the signature. Thus, the high resolution 

records; storing the identity of said second group in said scannin g ™V ^ Performed as a post-template 

memory; and subjecting the union of said first and second matching operation, preferably using different scanning 

groups to transitive closure. hardware. As a consequence of the limitations of current 

Furmerobjectsandfeaturesofmemventionwmbecome 40 ^nology, such high resolution is not used to capture the 

apparent from a review of the figures and detailed descri^ en ^™ g l ^ I ^ i , 

tion of the preferred embodiments, set forth below. would b f *J d bcc f?f f sl ° w 

decreased reliability and sensitivity to artifacts of such 

BRIEF DESCRIPTION OF THE DRAWINGS scanners. As technology improves, it is presumed that the 

The preferred embodiments will be explained by way of 45 application of such technology will be included within the 

the drawings, wherein: scope of the present invention. It should be appreciated that 

FIG. 1 is a flow diagram of an overall methodology of a ^ techniques according to the present invention have are 

system operation in accordance with a first embodiment of independent of the resolution of the scanner and reproduc- 

the invention* ^on device, and can employ the most currently available 

HG.2isaiiomerflowdiag^ *> technology based on the application demand, 

in accordance with a second embodiment of the invention; Further, after enough scanning 1 is completed for an 

FIG. 3 is a layout diagram of a typical check containing ^e comparing 3 is next initiated 

preprinted information; wherem & & image resulting from scanning 1 is compared 

ttt^ a • ■ . *. i jj j. ... „ . against a codebook database, and a match 5 is found. By 

FIG. 4 is a detailed flow diagram of the embodiment set M * . . , . , «. „ . ' 

forth in FIG 2* 55 matching, it is understood that a mathematically significant 

°™* m ^ . * ' ^ * t . . jj .I j, overlap in image between the scanned image and the stored 

FIG. 5 is a sub-flow diagram of the portion A of the ^ h sought ^ ^ words? identity is QOt necessaiily 

ow agram o Cr. , required in order for the match to occur, but rather known 

FIG. 6 is a sub-flow diagram of the portion "B M of the flow techniques can be used to determine a best match. A match 

diagram of HG. 4; ^ be considered so long as the pixel range between the 

FIGS. 7A and 7B are two graphs of the percent correctly scanned image and the stored image is within some thresh- 

duplicated pairs for a 1,000,000 records database; 0 ld lambda (Jl). Of course, the matching system may also 

FIG. 8 is a graph of the percent incorrectly detected compensate for possible transforms, such as slight 

duplicated pairs for a 1,000,000 records database; stretching, skewing, bowing or other alterations which 

FIGS. 9A and 9B are two graphs of time results for the 65 would prevent an absolute match yet are possible under 

sorted-neighborhood and clustering methods on a single normal circumstances. A check image which can be 

processor; matched, yet contains a significant difference from the 
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template is handled by an exception handler, possibly in a which may be a commercially available device for the 

separate operation. Such exceptions, expected to be rare purpose, and converted into a digital image stored in the 

occurrences* may be considered manually, or be subjected to processors memory. The processor first scans the check 

more extensive automated scrutiny. image for the presence of an identifying code which 

Alternatively, matching can be performed by computing 5 uniquely identifies the check form and/or preprinted infor- 
some feature of a patch or portion of the unknown image, mation on the check. The preferred code is a zebra code or 
and then matching this extracted feature against a codebook two dimensional code. Optionally, the foreground image 
of templates. The templates themselves represent features may be first separated from the background using standard 
and any other identifying information to allow for matching. image processing techniques before the code detection 
An image feature is a defining characteristic of an image, 10 occurs. In such a case, the processor divides the check image 
which therefore may define different or distinguishing image into various regions, which may be overlapping, and begins 
types. Alternatively, one may design a plurality of distin- t o look at the various regions of the scanned image. The 
guishing features of an image, each with some appropriately processor then separates the foreground from the back- 
defined identifier, like an image type and at least one feature. ground in the region of the image which it is processing. On 
The plurality of such features may be represented as a ^ a pixd-by-pixel or cluster of pixels level, the processor 
structured record associated with an image. Matching may determines whether the part of the image that it is currently 
be performed on this record of features or a portion thereof. analyzing is part of the background 4 of the image or of the 
Accordingly, the invention teaches the creation of a database foreground 16 of the image. When the foreground obscures 
having a plurality of records, wherein each record has the background, the processor optionally interpolates the 
unique elements comprising an identifying code and a 2Q background, or generates a mask of the foreground image 
collection of identifiers that distinguishes one record from which is ignored or processed separately from unobscured 
another. Thus, the image type and feature can be compared background. Determining that the processor is looking at the 
against the identifying code and collection of identifiers in background, it then checks the image for a zebra code 6 or 
the database. some other such similar code as set up by the financial 

By way of example: natural features would correspond to 25 industry or check manufacturer. If there is no such code, the 

certain obvious characteristics that a viewer of the image processor must compare the background to the database or 

would recognize, e.g. color, grain, etc. Artificial features codebook 8 to ascertain whether the check has a known 

would be the result of a mathematical manipulation of the background. Upon finding a match 10, the processor can 

image, e.g. histograms of pixel luminance, amplitudes, and subtract the background 14 from the image, replacing it with 

frequency spectra in some spatial region. 3Q a code from the library. If the match is not exact, but within 

The matched codebook data image is next subtracted a c^rtam range of matches, the background can be subtracted 
(subtract match 7 in FIG. 1 ) from the scanned image. By or filtered with the residual differences coded as additional 
subtraction, it is understood that any of a number of tech- information in the composite compressed image, 
niques can be used to filter or compress the amount of If no match is found, the processor preferably takes steps 
information that comprises the stored image, and need not 35 to update the adaptive database 12 to add this new back- 
actually involve full, and complete arithmetic subtraction. ground to the database and create a new code for it The new 
Further, the separation of the codebook data image from the background must be transmitted to a central repository so 
scanned image may be totally separate from the codebook that a receiver of the coded image data may decode the 
matching operation. For example, if a check background is background. Thus, upon encountering a new background 
light blue with a darker blue pattern, then the background w which passes exception checks, the image is forwarded to a 
may be separated by means of a simple color filter which is central clearinghouse for addition of the database and appro- 
not dependent on the background pattern. Further, pre- priate indexing and processing. This new check background 
printed information may also be separated in like manner, by image will be included in an update release of the database, 
filtering for straight lines and other expected features or Upon receipt of the new background, the clearinghouse first 
objects on a check. 45 checks it against its own, more updated, database, and will 

Thereafter, a code is generated in element 9 identifying return an appropriate code to the processor if one has been 

the matched image in the codebook. By generated, it is previously assigned, or assign a new code if none exists. In 

understood that within the codebook each stored template is the absence of feedback, the new check will be stored as a 

represented by a code, which uniquely identifies the tern- compressed image according to known techniques, 

plate. This code may be a binary integer or other type of 50 After deterniiriing the background code, the processor can 

index, subtract off this now known background from the check 

After matching the scanned image with a stored image, image 14. If a zebra code was initially found 6, then the 

the separated foreground image information is stored linked background can be immediately identified and subtracted off 

to the codebook identifier of the separated background 14 replacing it with the code from the database library, 

image. In this manner, the entire, original scanned image can 55 In the preferred embodiment, simultaneously with the 

be reproduced, while the storage of this image is signifi- comparison by the processor of the background for a match 

cantly reduced in spatial, storage requirements. The fore- with templates in the database 8, a detenriination of a 

ground image preferably undergoes further processing in foreground area of the image 16 can be made by standard 

order to reduce the information storage capacity necessary. filtering techniques, with subsequent compression of this 

This compression may be performed by further reference to 60 portion of the image area 18 using standard compression 

a codebook, for primed information, or by other types of method techniques selected for this purpose. While the 

image compression for e.g. the handwritten information. All check image is being processed, all of the data of the check 

of the information contained in the check image is stored in must be analyzed to detect indicia of fraud or irregularity, 

linked fashion, which may be physically adjacent, or other- This preferably occurs after the background has been 

wise mathematically related. 65 identified, so that the analysis may take into consideration 

FIG. 2 shows an alternative embodiment of the invention. the matched background image. This analysis may take 

The maker's check is scanned in by a scanning device 2, place concurrently or subsequent to other processing of the 
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check* The foreground is checked for any fraudulent indi- a scanner or "scanning fax" interface. Data communications 

cations 20 such as an invalid signature or an improper may be implemented through standard telecommunications 

amount. If any fraudulent indications are found, the check is interfaces, e.g. v.32, v.32bis, v.34, switched 64 DSU, frac- 

rejected 22, and is subjected to exception handling, which tional Tl, Tl, T3 or ISDN systems. While it is preferred that 

may include, e.g. notification of the bank operator. 5 the check data be completely read in a single pass, in low 

After the background data has been subtracted out and the production environments, the check may permissibly be 

foreground data has been compressed with no fraudulent scanned into the system multiple times to ensure that it is 

indications, the remaining image is then transmitted 24 to read in correctly. Alternatively, if the check is read in 

either the payor's bank or an intermediary clearinghouse. multiple passes, these be performed in series, allowing high 

In another embodiment, the paper check is written either 10 throughput. This initial complete imaging is important 

as a personal check or as a corporate check by the originator because once the scanning device determines that the check 

and given to the depositor for payment. A personal check is has 1x5(511 scanned m me actual ^ document 

typically 36 square inches, printed on quality non-bonded *W v te T me $ T n€ f^ f 

paper with a weight between 20 and 28 pounds. The size of f "*> * ! s **md once me electtonic data is 

f\ . . , .„ £ , r_ ™ A s n authenticated and verified, the paper check is destroyed, to 

a corporate check, will range, for example from 20 to 60 15 M ^ ^ chcck u ^ once . 

square mches of the same quality as the personal check w OT to,orafterscai^ 

Typically, the front of the check will have some sort of a ^ scanner tic ^ 56 

background pattern whereas the back of the check may or ^ e informat i on such ^the issuing bank, bank 

may not have a pattern. The front pattern may be one of a £ C0Uflt number and number of me chect 

repe^blenature,of ar^dom 20 A ^ ^ look at ^ scanned ^ of me 

ba <* of Ae check ^ g cneraUv a or chec £ { £ -zebra- coding or any other type of identifying 

repeatable pattern. codc on fte in t0 ide ntify me background of the 

The different parts of a typical check are as annotated: 26 check as a particular type 60. This will indicate what the 

represents the payee's personal information, ie. name, background is and from which printing company or source 

address, telephone number; 28 is the preprinted border 25 the check originated. Since the entire image will be scanned 

surrounding the check; 30 is the preprinted ward "Date"; 32 in at about 300-600 dpi, the zebra code itself may be made 

is the sequential number of the check (same number that is unobtrusive to the naked eye. Further, the zebra code may be 

part of the magnetically printed information; 34 is the printed in a fluorescent or invisible ink, and scanned using 

preprinted word "Dollars**; 36 is the preprinted word "Sig- a laser or other excitation source. The code is checked 

nature**; 38 is the preprinted word "Notes" or "Memo" or 30 against a database of codes from the various banks to 

some other such notation indicating that this is the area to determine the type of background 62. Although the database 

write a general comment concerning the nature of the check; can reside in any number of storage devices, the optimal 

40 is the name and address of the bank that the check is distribution format for cost and ability to easily and inex- 

issued against; 42 are the preprinted words *Tay to the Order pensively update is a CD-ROM. 

of; 44 is where the maker indicates to whom the bearer or 35 If the database is stored on a media which requires 

depositor of the check is to be; 46 is the area where date of physical access of stored data, then certain methods may be 

issuance is placed; 48 is the area where the maker writes in used to improve the apparent access time over the average 

the courtesy figure (numerical amount in decimal) of the or worst case access time. See, Lippman, A. et aL, "Coding 

check; 50 is the area where the maker or makers of the check Image Sequences for Interactive Retrieval", Communica- 

must sign the check; 52 is the area that the maker or bearer 40 tions of the ACM, 32(7):852^861 (1989); Yu, C, et al., 

may write any memo or notes as to the nature of the check; '^Efficient Placement of Audio Data on Optical Disks for 

54 is the amount that is to be drawn from the check, written Real-Time Applications**, Communications of the ACM, 

in alpha-numerical symbols; and 56 is the magnetic infor- 32(7): 852-861 (1989). In order to maintain the speed and 

mation of the bank, including the bank code, payor's efficiency of the system, the CD-ROM database must be 

account number, and check number, printed in OCR font 45 segmented into various "bands", e.g. groups of database 

(the so called MICR line). Elements 26 through 42 are all records which are accessible, from a starting position, more 

preprinted information supplied by the printing company. efficiently than database records outside the band. Thus, a 

Elements 44 through 56 are areas on the check that will be number of the database records in the band may be accessed 

filled in by the maker or bearer of the check The date 46, in a short period, or one of a predefined group of records 

courtesy amount 48, and the alpha-numeric check amount so need be accessed. The band is preferably a spatially defined 

44 can be typed in by the maker, whereas the signature 50 region or series of regions of the storage media, but may be 

must be handwritten (or a stamp of a handwritten signature). interspersed over a large area with an optimization for access 

Hie note or memo area 52 is optional and need not have based on a parameter other than spatial distance. In a 

anything in it. AH other foreground areas should be filled in. preferred embodiment, a band includes background images 

The check is then delivered to the depositor's bank for 55 or patterns which have similar characteristics, which are to 

verification and processing. At this point, according to the be searched in a particular pattern matching operation. These 

present invention, the check is scanned into the system for patterns are stored on a rotating storage (e.g. magnetic, 

pattern recognition and electronic processing 58. In order to optical or magneto-optical) medium at a range or radii, e.g. 

insure completeness and enhance the check fraud detection, on adjacent tracks of the storage drive. This arrangement 

both the front and the back of the check need to be scanned 60 minimizes read head movement and repositioning during 

in. In addition, the reverse side of the check normally access. Likewise, in semiconductor memory, linked or 

includes the endorsements, which are part of the information related information may be stored in the same page frame to 

which must be captured from the check. Sophisticated increase retrieval efficiency. 

imaging systems, e.g. systems incorporating advanced If deciphering the bar code or zebra code requires refer- 

features. are not required for the scanning process. Rather, a 65 ence to a pattern template database, it is preferred that this 

standard stand-alone imaging system, including an Intel 486 database contain information which is banded together for 

or Pentium PC, CD-ROM drive (preferably 4x or 6x), with quick and easy access. 
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In a preferred embodiment, an initial preprocessing of the to perform this operation (if zebra codes or other informa- 
scanned image determines a characteristic of the scanned tion which simplify the pattern matching task) are not 
image. This characteristic corresponds to a group of check available, the check image may be compressed in a number 
templates representing the subset of check types which of ways using these computational resources and the most 
possess the given characteristic. Therefore, after the image 5 efficient result, e.g., smallest compressed data file, or 
is processed for a characterization and identification of a first alternatively, highest quality image, retained. This image 
characteristic, a database subset may be defined which compression preferably occurs simultaneously with the pat- 
includes the possible matches. Of course, the preprocessing tern matching, so that a finite limit may be placed on the time 
may produce an identification of a number of different for the check processing operation, and therefore a minimum 
characteristics, each corresponding to a different subset of 10 throughput assured. It is preferred that the scanner sub- 
the database. This preprocessing preferably occurs without system be the throughput limiting element, and that a data 
reference to records stored in the database or only to processing system be employed which assures that the 
information stored in a cache for fast retrieval. This prepro- scanner is operating at peak speed. The processing sub- 
cessing simplifies the pattern matching task which is system therefore should be selected in accordance with the 
required to positively identify the check. In the case of a 15 scanning speed of the image scanner, 
zebra code or other unique check identifier, this preprocess- When check styles are uniquely coded, the first step in the 
ing completes the entire task of identifying the check. identification prc»cess is to determine the identity of the code, 
However, even an extraction from the check form of the and look up the code in a database 64. The background can 
identity of the manufacturer will greatly reduce the compu- now be tagged with the codebook code from the database 70. 
tational complexity of the task. 20 If mcre is a match to the database, a second test may be 

After a database subset of possible matches is defined, the performed as a background task or as a delayed task. This 

possible matches must be excluded or determined to be true, may be delayed while the necessary information becomes 

however the early identification of a true match will, of available to the processing subsystem. This second test is to 

course, eliminate the need to exclude all others. Therefore, now look up the pattern, represented in the database, based 

the images should be compared in order of likelihood of a 25 on the code and compare the entire scanned check with the 

match. This likelihood may be derived from the popularity image in the database 72. This is done to ensure that the 

of a given pattern, either nationally, locally or for a given check that was scanned is not fraudulent in any way. or that 

system, a statistical likelihood of a match based on par am- the code is not corrupted. This will also allow the printing 

eters derived in the preprocessing phase, a linkage of a check company to add minor variations to the background pattern 

type or pattern to a customer's account or other criteria. 30 that can generally only be seen or deciphered in the scanned 

It should be noted that in a high throughput system as image, thereby providing a system for fraud detection and 
might be found in a large financial institution, it is preferred prevention which cooperates with the check processing 
that the database access not impede the throughput of the system. If no codebook entry is found in the database, the 
scanning/recognition system, nor that computational operator may be notified 66 that this is a non-standard check, 
resources be under utilized based on a large disparity 35 and processed as an exception. As stated above, if the check 
between peak processing power required and minimum is nonstandard, yet nevertheless appears genuine according 
utilized. Thus, after the preprocessing, and during the data- to fraud prevention criteria, then it may be compressed using 
base access phase, a second check preprocessing operation a standard compression algorithm, as described above, 
may be commenced, while mefiKtremains in a queue. Thus, Further, an indication is made which will allow this new 
since the storage media access is generally slow, the pro- 40 check to be included in an image template database update, 
cessor or processor system, which may include a plurality of and to allow consistent coding of this new style, 
processors in a symmetric or asymmetric parallel processing While this check image matching operation is being 
array, may divide the check image compression task into a processed, the foreground image, which includes the printed 
number of segments, each segment to be processed when all information on the check and the added, variable informa- 
of the necessary data is available, yet making the processing 45 tion which is generally handwritten, is processed and corn- 
resource available to other tasks when the necessary data is pressed. Hie preprinted information may be subject to the 
unavailable. same type of model-based compression as the background 

It is noted that, in the case of new check types or image, in particular for constant areas such as borders and 

aberrations, a match will not be possible from a published designs, while certain portions, such as bank identification, 

database. It is therefore preferred that, while the preprocess- so drawer identification, account number and the like may be 

ing and image matching processing is in profess, the check subject to OCR with identification of the text content and 

image simultaneously be subject to a compression algorithm particular font Finally, the particular information relating to 

that does not require an exact match. This image compres- the monetary transfer embodied by the check must also be 

sion preferably makes use of the results of the intermediate captured, compressed and preferably analyzed for 

calculations and data structures, if this is efficient, thereby 55 completeness, inconsistencies, possible indicia of fraud and 

reducing the additional computational load for perforrning information content. Handwritten information may be 

this alternate compression. Thus, the preprocessing charac- compressed, for example, by modeling the handwriting as a 

terizations and pattern matching results may also be used to B-spline and storing the control points, which will generally 

help provide a description of the check sufficient to fully require fewer bits for storage than the raw data. Of course, 

describe the information content thereof, albeit in a manner 60 other compression methods are known in the art 

less efficient than if a code referring to an exact match is A preferred processing system includes a parallel or 

found. However, under certain circumstances, a standard multi-processor system, which may serve as a coprocessor 

compression method may be more efficient than employing for a standard-type personal computer. The processing sys- 

intermediate results of the pattern matching algorithm, and tern preferably has its own mass storage interface, indepen- 

therefore is preferably employed. Because of the computa- 65 dent of a mass storage interface of any host processor 

tional complexity of the pattern matching, and therefore the system. The preferred interface to a mass storage device is 

large amount of processing power which must be available a SCSI or SCSI-2 controller, although other types are 
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acceptable. The controller or operating system preferably analysis is done on the check image in order to gain as much 

executes predictive seeks and has a large cache. Further, a information about the check as possible or as necessary. This 

slow CD-ROM may be shadowed onto a fast magnetic hard will minimize the number of times the database needs to be 

drive. accessed, and the number of database records which need be 
The foreground processing, many aspects of which may 5 retrieved, in order to maintain the throughput of the system 

be performed in parallel, proceeds as follows. The process- The background, or a portion thereof, is analyzed to deter- 

ing system performs an image processing operation which mine if there is a repeatable pattern 76. If the pattern 

separates the foreground from the background image. The segment repeats, the segment is classified as a background 

foreground image is further decomposed into primitives, pattern for this check image 78, at least for the area analyzed 
such as lines, borders, line graphics, printed text and hand- 10 or for a P OIti on of me image. As stated above, this interme- 

written text Each such foreground graphic feature, as it is diate 045 used in an image compression system to 

identified, may be subtracted from the remaining image dcfiBt ^ background by specifying the pattern and its 

content 80. Alternatively, a constellation of foreground spatially repeated characteristics. The image pattern, 

features that commonly appear together, e.g. postprinted ^ev^r continues to be processed, as a unique identifica- 

features, may be represented by a code, and thus be repre- 15 bon of m ? ldeD T * e "ff™ WlU f upenor A 
. .* F u- u^iouui^ 15 com p ression results. The identification of a background 

sented in highly 'compressed fashion, proceeding m analo- ^ {& ^ ^ ±t ^ ^ ^ a 

gous fashion to the back^ound coding operation. Thus, the £ ubset of wim a co^sponding background pattern. 

^1^7^ border 28 that surround, the check ^ ^ ^ ^ ^ for me existence of a 

and subtract it off of the scanned image. The foreground number of * nara c^stics, the union of which will define a 
processing system c^ the bank ^ relatively small number of possible matches in the database, 

that issued the check 40 and provide a code for that ^ ^ ^ ^ ndin repe atable pat- 

information. The bank identifying information may then be tcrn segments in the background, a primary key for 
separated from the remaimng foreground image; however it searching, are preferably banded together on the storage 
is preferred that the actual image indicating the bank identity medium, which is preferably a CD-ROM or CD-ROM with 
be fully characterized in the compressed image. Another 2 s a fast magnetic hard disk shadow, for efficient accessing. If 
process can determine the name and other personal infor- a second primary key is applied, a second storage subsystem 
mation of the payee 26, Le. address and telephone number, with bands corresponding to the second characteristic is 
if available and store this postprinted information as text, preferably employed. A second key may be, e.g., color, 
font and formatting information (page description lithographic technique, paper quality or type, etc. The pat- 
information). The image of this data is then separated from 30 tern segments are compared to those in the subset of the 
the foreground image. database for a match. The comparison is preferably done via 

Since it is preferred that the check scanning device be a rule-based expert system to determine if any conflicts exist 
allowed to operate at peak speed, and finite delays in (such as extraneous or missing data). Otherwise, the match- 
information access and processing occur after the entire ^ ^° ne using a pattern recognition algorithm known in 
check is scanned, the system will generally operate with a 35 the art The rule-based system will typically follow the 
processing latency wherein a result is not produced until production system model consisting of a set of rules and a 
some time after the scanning has occurred, and likely after database of facts. This can be done with the use of the 
a subsequent check has been scanned. In such an instance, multi-processor and distributive processing hardware. If 
it is preferred that the processing system globally optimize there are no conflicts, i.e. the background image matches one 
the processing based on the availability of data, which may 40 of me patterns in the database, the background of the image 
result in out-of-order processing of information. For 030 be replaced with the code of the pattern from the 
example, if a first check does not include a zebra code and database codebook 110. If the match is not exact to one in 
a second check does, it is likely that the processing of the me database, but is within certain prescribed tolerances (A) 
second check, which requires less processing, will be com- 112, the background can still be replaced with the code from 
pleted before the first check, which requires only simple 45 me database codebook 114, but the error or deviance from 
processing, even if this delays the processing of the first the exact match is preferably compressed and included in the 
check slightly. Thus, once scanned, image data records will composite compressed image 116, Once the background 
then be queued up for further processing when the necessary image has been matched, the process to filter out the 
data becomes available to the processor. Checks that have a background from the image can be batch processed with 
unique identifying code may be run first, before checks 50 checks of the same background either in turn or in parallel 
which require more extensive processing, as they can then Because of the relatively slow access and data transfer 
be compressed much quicker than the ones without any rates of a CD-ROM, the preferred distribution media for the 
code. In one embodiment according to the present invention, image database, the read head should be predictively posi- 
the processing system is not a dedicated image processor, tioned near the data to be retrieved, so that the distance of 
but rather a standard type personal computer or workstation. 55 head movement between accesses is minimized to the extent 
In this case, the pattern matching delays may be very long, possible. Further, the number of accesses and the amount of 
with in-process check images stored on magnetic disk. In data that needs to be transferred is preferably minimized, 
such a case, zebra coded checks will likely be processed in The pattern recognition algorithms, for extracting a back- 
real time, while those which must be matched without the ground pattern from the check, preferably use a tuned filter 
assistance of precoding will be delayed, and processed as 60 (e.g. a spatial image processing filter which is selective for 
necessary information or processing resources become a pattern or class of patterns), for the similar characteristics 
available. This may create a substantial backlog of unproc- of the background such as image and colors 102. A subset of 
essed images, but this does not cause a problem in a low the database is defined to minimiz e the number of accesses 
production environment, where cost effectiveness is more to the database 104 and the database is searched accordingly 
important than high productivity. 65 106. 

If there is no identifying or zebra code on the image to A rule based system analyzes the image for conflicts in the 
unequivocally identify the background image, an image pattern extraction system, and attempts to resolve them. If 
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resolvedL the new pattern is then analyzed to determine if it pression. In such a case, it is preferred that the compression 

is a repeating pattern. A repeating pattern will define a subset format be compatible with the pattern recognition and 

of images in the database. If the background does not analysis format 

contain a repeating pattern, then it is analyzed for the The libraries in the database are preferably correlated into 

presence of other characteristics, which may be found on 5 a codebook of standard check background images. These 

checks which do not have repeating patterns. This may libraries are based on certain groupings or bands of infor- 

require analysis of the entire check, or the characteristic may mation. These bands will consist of similar colors or types 

be extracted from a portion of the check. of patterns. In order to maintain the efficiency of the system, 

. . . i j * • u *u when it is determined that the background pattern is within 

Ttemattii^ a ceftain band 0fl tfle CD .^ M w * ith particular 

there is a partial match. If there is a partial match to more 10 characteristicSi ^ CQmm of ^ band be ^dictfvely 

than one image 118, a composite description is created 120 transferred to a cache for prospective analysis. Preferably, 

describing how the check is a composite of the multiple ^ rate of downloading of data will meet or exceed the 

stored images and what (if any) are the differences 122. This processing speed of the matching process; however, if the 

method of compression will only be employed if the result- data access is slower, then the system may perform other 

ant compressed image is smaller than a file created using 15 tasks until the necessary data becomes available. Other 

other methods. If there is not a complete match, i.e. not all libraries in the database include, for example, image features 

of the conflicts can be resolved, the background will be of the checks, account numbers or other identifying infor- 

compressed and an error analysis will be appended describ- mation with known backgrounds, account numbers or other 

ing what are the unresolved conflicts. identifying information with unknown backgrounds, signa- 

Thus, the present system provides a variable size com- 20 tures and other handwritten information or specialized 

pressed image corresponding to the check image, with the backgrounds, e.g. computer screen, stock certificates or 

size of the image varying from a simple series of codes c °?> ora * ^ ceruses. These can be linked together in 

which defines the check, deluding annually compressed %<^ de * ^ depending upon the priority of the list 

, . « . - . . of features or information, 

image of the handwritten and post-printed information using _ , ... ^ 

a model-based compression abroach, to a full image com- 25 . * e V* feaed t » mbodjment < "f «° * e P^ent 

any check wmch is of pamcularly poor quality be manually ^ » (2) seconds fof J one »»£ to be 
processes as it is likely to require manual analysis in any scanned Mme ^ next scanning operation commences, 
case and a highly compressed image may tend to lose fl j ^ for letioil of proceS sing is 
important data, especially m a poor quality original Urns, ^ ssible . m proccs ^g ^ while preferably 
the ultimate exception handling sequence includes operator sustainable indefinitely, should be sustainable for at least one 
in erven on. nour ff within the time constraints, no match to the database 
If a pattern is identified, it is then compared to a stored set i s f oun d, the pattern matching process will be terminated and 
of libraries of backgrounds to determine if the background 35 the entire image will be compressed using standard corn- 
is of a standard type, e.g. a type by which entries in the pression techniques. 

database are indexed. The libraries will consist of raw check Such stan dard techniques include compression by defin- 

images, and may also include characterization data relating ing a set of Fourier or wavelet coefficients to 

to those images. If it is desired to reduce the size of the Scribe the image with a reduction in raw data. In addition, 

image database, then these images themselves may be ^ new images niay added to the Hbrary, as an addendum in 

compressed by a model-based compression algorithm which a temporary storage medium, such as RAM, local magnetic 

defines an image with respect to simpler graphic primitives. hard EEPROM or flash memory. These added data 

Thus, although preferred, it is not necessary to store entire entries preferably include the same indexing as the database 

backgrounds of the standard checks since the background is itself< ^ugh addenda may be processed separately 

normally of a repetitive nature. A copy of &e database 45 and by different criteria, U. since these are stored on a more 

preferably is tailored to fit on a single CD-ROM or other accessible media, me inatching need not be optimized for the 

integral storage unit Because of banding requirements, if p resume d slow access of the main data store. If these 

space permits an image of a check may appear in multiple unrecognized checks meet the criteria established, they are 

positions in the database; however the identifying data i nc i ud ed in an update of the distributed database. If the 

should remain the same for the image regardless of which 5Q system & the new image may also be sent to a 

copy is recognized as a match. clearinghouse or the other banking sites so that the other 

Compressed storage of the image template or model data libraries may be updated on a real-time basis. These may 

of the database is preferred because it reduces the penalty of also be updated at a later time. 

the slow data transfer rate of a CD-ROM, and allows quicker Regarding the detection of the background image of an 
access of related image data. Compressed storage also 55 unknown inputted document image, one may elect to corn- 
increases effective storage capacity of the CD-ROM. p Ute some feature analysis during or immediately following 
It is only necessary to maintain in the library enough scanning to accelerate the task of determining the back- 
information to recognize the repetitive nature of the back- ground from the codebook database. As an example, a color 
ground. The background information stored in the database scanner may be designed to set certain mask bits indicating 
may be in a standard compressed format If the format is not €0 a scanned document has some particular color, e.g. red. The 
lossless, then an error analysis mechanism is provided to bits set during scanning may be used to direct further 
ensure a match. When particular information is retrieved, it processing of unknown input images, e.g., the red color bit 
can be decompressed for the pattern recognition and analy- is set during scanning, but the yellow color bit is not set The 
sis. On the other hand, depending on the type of compression document image can then be further processed with other 
employed for data in the database and the requirements of 65 images containing red in bulk fashion, but completely sepa- 
the pattern recognition system, the image data may be rate and independent from those document images contain- 
analyzed without decompression or with only partial decom- ing yellow. 
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The foregoing implies that a hierarchical codebook orga- statistical measures, spatial domain features, frequency 

nization is possible. Several independent and separate code- domain features, relationships between objects or other 

book databases can be organized whereby members of the features. 

same codebook share perhaps some common features. Con- One can further define a centroid template of each dis- 
arming the example, there can be election to have a "yellow" 5 tinguished subset that may be used to detect early in the 
codebook, i.e. background templates that contain some process which codebook to search given some unknown 
yellow are "clustered" together in one logical codebook, and image. The unknown image is first compared to a set of 
thus only unknown input images with a yellow bit set need centroid templates that embody perhaps a number of feature 
be tested against this codebook. Concurrently, there may values, and the closest matching centroid indicates that the 
also be a "red" codebook to process unknown input images 10 background for the image is to be found in the codebook that 
against with red, but not yellow. it represents. The centroid may be computed by a clustering 

The approach is know in the art and many different program mat computes me means of me feature values of the 

features can be used for this organization. The intent is to members of the codebook formed by clustering. Such cen- 

divide "feature space" into separate domains of the database troids represent mean feature values of some sort that are 

that can be inspected in parallel. The computational benefit 15 chosen or computed by some clustering program to accu- 

is the reduction of the size of a codebook that is searched, rately distinguish the unknown background information, 

and the parallel processing of multiple unknown input The clustering program itself may be viewed as an offline 

images, process performed to initialize a set of codebooks or hier- 

By way of example, if there are 100 backgrounds, 50 of archicaliy organized codebooks. As such, it may be executed 

which have yellow but no red, and 50 of which have red but 20 in parallel for efficiency in speed either to initialize 

no yellow. Then, given an unknown input image of a codebooks, or to update existing codebooks, or for entire 

document, known to have yellow, the document would be reorganization of existing codebooks. 

compared to at most 50 backgrounds, rather than at most 100 The manner in which clustering programs can be executed 

backgrounds. Therefore, the computational complexity of mr^aUdissiinilartomemaimermw 

the search has been reduced in the worst case by halving the are compared to or searched in a codebook of templates, 

size of the codebook database searched. In parallel, a search Here, it is iteratively compared to a distinct template, t, to all 

of the red codebook for an unknown image, will cause the previously processed templates presently residing in some 

same reduction in computational complexity. And since both codebook, Q, where i is greater than 1. Some distance 

the red and yellow documents may be processed in parallel function, d, is computed between t and every codebook by 

by two separate parallel processors, obvious reduction time 30 computing the distance of t from some member of Q or a 

is multiplied. centroid representing Q. That codebook C y found to have 

Now, however, designing a means of forming hierarchical some minimum distance is selected for membership by t, 

codebooks by further dividing the yellow codebook into i.e., t is assigned to C p and CJ5 s average feature values are 

multiple, smaller, disjoint or small overlapping codebooks, updated to reflect the incorporation of t in its subset 

e.g. yellow and blue but no purple codebook, or yellow and 35 Further processing of the selected subset is possible, for 

purple but no blue codebook is possible. example recomputation of some global mean or statistic for 

Further, the hierarchically organized codebooks define a all present members of the selected codebook. Such recom- 
logical decision tree or indexing structure for searching a putation statistics, like the mean, can be used to represent the 
database of templates. This means that the processing of an ^ centroid template of the subset used for further processing of 
unknown input image on a single processor can be accom- additional template backgrounds. The comparison of t to 
plished with hierarchical codebook organizations simply by existing codebooks can be performed in parallel. The corn- 
using the organization as a search tree to find relevant putation of statistics for a distinct codebook, like the mean 
templates for comparison to the input image. Thus, the values of some feature, can likewise be performed in par- 
organization serves as a single database indexing structure, 45 allel. 

or a decision tree, or a means of parallel processing multiple As part of the parallel processing, the foreground infer- 

images against separate collections of templates. matron can also been compressed. As noted above, the bank 

' Producing several codebooks or possibly hierarchical information and the border can be truncated and replaced 

codebook organizations, requires a means of dividing the set with codes. Any other preprinted information that is sup- 

of background image templates into subsets that contain 50 plied by the printer, such as "Pay to the Order of 42, 

related backgrounds, according to some criteria over feature "Memo" 38, "Signature" 36, "Dollars" 34, '*Date" 30, check 

space, while the subsets are disjoint, i.e. two members of two number 32, digitalized code 56 or any other information that 

distinct subsets have distinct feature values. is printed on the check as part of the standard printing 

Such organizations of data and databases are commonly information, is either part of, for example the zebra code or 
achieved by various clustering algorithms. Clustering algo- 55 2-dimensional coding and can be truncated, or can be 
rithms are generally iterative optimization programs that in replaced with standard text characters with font and format- 
piecemeal fashion compute averages over some feature ting information for standard compression 38. 
domain of individual objects, Le. document image back- The scanned data is now compressed by deleting the 
ground template, in such a way as to rninimize global background pattern and the preprinted information, leaving 
optimization criteria, e.g., the minimum squared-error cri- 60 only the personalized information from the check along with 
terion. Those skilled in the art will understand that such the code far the background pattern and the preprinted, 
programs approximate maximum likelihood estimates over foreground information. As with the data recognition of the 
the means. background information, the personalized portions will 

It is sought to have easily detectable features that can work the same with the use of the pattern directed inference 

cluster backgrounds into related subsets of hierarchically 65 system and the rule based algorithms, 

organized codebooks. It should be further understood that In one embodiment according to the present invention, the 

any number of possible features can be used, for example payor's signature on the check 50 is verified far authenticity 
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by comparing it with a database of signatures 84 including sent. This information can be built into the outside encryp- 

a representation of the signature of the drawer. If the tion method. If there are not the proper levels of encryption, 

signature does not match a corresponding signature in the or the codes are not correct, the transmission is deemed not 

database 86, the bank operator needs to be informed for valid and a retry is requested. If the encryption is still not 

manual verification and the possibility of a possible firaudu- 5 correct, the receiving facility will be so notified as to a 

lent check 88. possible fraudulent transmission. 

^ In ^% PrCf ^ ed t T" 0, * C T^ft WiU be After completing the correct protocol with the transmit- 

checked for against the numerical value written 48 to ensure f • « • . . * ... ... . • . . 

tnattheyarethesameandtoensurethatthereisasufficiency ^ facility and establishing a valid transmission i to me 

of funds in the payor's account. The date 46 is determined 10 ^ ^ acc0UI1 w ^e settled and a 

from the written information 90 and is then verified to 10 ftemeat wdl be sent to the originator. Otherwise an error 

safeguard against a stale check (normally one more than ^ be «' t0 £ t ?°T ° perat ° r m£hCaUng ** 

three months to a year old. depending upon the type of ^ 15 a P roblem Wllh ^ check * 

check) or a check that has been post-dated, i.e. where the . ***** of electronically transmitting the compressed 

date on the check is later than the actual day of deposit 92. 15 in&age of the check, another method is to download the 

The bank operator is preferably informed of any problems omo magnetic media and transport the media via 

with the dating of the check 94. Endorsements on the back conventional physical transfer mechanisms. This may be 

side of the check are also preferably examined for accuracy, done m of setting up a secured and/or trusted series of 

fraud and authentication by comparing it against the stored network communication and still obtaining some of the 

signatures in the database. benefits of compressing the data. Bulk transport of the mass 

If the foreground data is handwritten, it can be further 2 ° ™xutoct physical paper checks is no longer necessary and 

reduced by employing the model-based algorithm of 1S re P laced Wlth * ^ on ™g™*c media- Such physical 

B-spline 98. This will compress the written characters to transport may also be employed in case of failure of the 

control points. In order to save space on the database telecommunications network. 

medium, the preferred method will only have the B-spline 25 Another embodiment according to the present invention 
control points stored for comparison of the payor's signature includes a scanner that produces multiple images at different 
and of the endorser's signature. If the data is typed or printed resolutions and gradient scales. This way there is a greater 
on the check, such as for the check amount, or the courtesy probability of being able to match the background informa- 
figure, the processor need only determine the font and ^ with ^ information stored in the database. This will aid 
formatting characteristics of the information 100. The image 30 me matching of the models in the database to the scanned 
portions of this information can now also be eliminated, image. As discussed above, an ultra-high resolution scan- 
further compressing the data to be transmitted and stored. mn 8 mav ° e required to image and verify certain anti-fraud 

In order for the banking system to maintain security and lithography on the check, 
transmission verification, encoding and encryption prefer- As discussed above, images of text may be subjected to 
ably employed on the compressed data before transmitting 35 optical character recognition (OCR) for decomposition into 
to the other sites or to the clearinghouse. A protocol is text , font and formatting information. The present invention 
established between the connecting facilities. The com- also allows the use of other presently available or future 
pressed check image may be encoded, e.g. given a transac- options, and is thus compatible with OCR data conversion 
tion identifier, based on the number of transactions done that (even if done manually) or any other type of commercial 
day or the number of checks that were scanned, or some 40 scanning process. Thus, after the check image is scanned, 
standard form of encryption coding. The transmitting facil- me c ^ta is available for any type of analysis, including the 
ity must request authorization from the receiving facility to present model-based pattern recognition and extraction tech- 
send the compressed data. niques. This compatibility with standard OCR systems and 

The compressed data is now transmitted to the central the like is especially helpful for post-printed, typed infor- 

storage device via a telecommunications system, e.g. inter- 45 marion such as the alpha-numeric amount, courtesy amount 

act system. The receiving facility would either be the payor or even me signature (if stamped). This information may be 

bank or a clearinghouse. The compressed image is prefer- captured from the various subsystems and can then all be 

ably archived at least one of these institutions. If it is reduced to codes in the codebook. The decompression and 

transmitted to the clearinghouse, it must then be transmitted regeneration of the checks would then be of a cleaner 

again to the payor bank. After the data is transmitted, the 50 version then the original. 

transmitting facility needs to know that the transmission was Likewise, the image data remaining after elimination of 
valid and that the data was received correctly by the receiv- the background image information may be further reduced 
ing facility. Once this is accomplished, the transmitting DV going through standard compression algorithms, as for 
facility will either erase the compressed data from its storage example JPEG in order to eliminate any white space or other 
or alter it indicating that the image has been transmitted to 55 redundancies. See Wallace, G. K., "The JPEG Still Picture 
another facility. This will keep the system integrity secure Compression Standard", Communications of the ACM, 
and will ensure that the check is processed only once. If 34(4):30-44 (1991). Thus, the present invention may be 
there is to be multiple transmissions of the data, e.g. between applied as an open system, which is compatible and works 
the payee bank and the clearinghouse and then between the in tandem with existing related systems. Other image corn- 
clearinghouse and the payor bank, there is preferably a 60 pressionmemods are also known, including ABIC, U4, and 
system including multiple levels of encryption. In this run length encoding. 

instance, the payee bank will put on two levels of encoding. Along with doing parallel operations for the background 

Hie first one is to be decoded by the clearinghouse and the and various foreground data checks via a multi-processor 

second one is to be decoded by the payor bank. The system, the searching for the background code and/or the 

clearinghouse is in this sense acting as a controller of the 65 scanned image pattern can be broken down and searched for 

data and does not need to actually know what the data is. All over a network of processors, which may be loosely or 

the clearinghouse needs to know is where the data is to be tightly coupled. See Tanenbaum, Modern Operating 
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Systems, pp. 3, 362-394 for a discussion of distributed fiching techniques, and simplifies the elaborate production 
process operating systems. In this case, it is preferred that necessary in order to directly microfilm checks, 
each processor maintain locally in fast memory at least a ^ example relates to a further adaptation of the 
related sub-set of the total database. The scanned image is ^ compression ^ storage ^ for reduci ^ ^ 
broadcast over a secured network to _a nurnber of processor 5 im^vi^ security in banking practice. In this instance, a 
sites and each site can search its portion of the total database . , _.. J b *" . . , 
in a fraction of the time it would take to access the entire ?"* camera ™* fl ^ OT ™ of 
database as a whole. The different databases would have of » P* 80 " f a .check. Informahoii .relating to 
different criteria to search for. e.g. different compressed ^ persons separated from the background, and the image 
images, different scales of scanned images or different 10 fTTV* TfT^ ^may then be appended 
search algorithms. Once a match is found, a broadcast 10 t0 ** ^t-V ^ «° FT* 
message |oes out to the other processors in the network to ca *J D g * C * C sh ?f d 1x5 311 «^ue ^temahvely, data 
cease the search and the matching code is sent to the original f ^ to • fingerprint of the person cadung the check may 
site that initiated the search. Otherwise, a "not found" ? e capt^d, compressed or analyzed, and appended to the 
indication is sent across the network, after the slowest 1S I ^S* data/j^ 6, This concept could thus be used for storing 
processor completes its necessary tasks. This method greatly 15 Uld ^™"*£ g ldentficatlon information for law enforce- 
reduces search time but there is a cost in order to commu- 
nicate over a secured network between the separate proces- A video analysis system may also be used, in conjunction 
sor sites. If these sites are remote, then an encryption with a model based analyzer to efficiently generate a detailed 
security method is preferably employed, similar to those 20 representation of an individual from a series of video frames 
described above with respect to transmission to the clear- or from a scanned photograph. This may be done in con- 
inghouse or other banks. junction with a type of identification card with a person's 

Another embodiment according to the present invention P icture 0R iL Rotational algorithms may need to be imple- 

includes a system in which the image to be matched does not mented m order t0 normali2e m to a frontal oricnta - 

come from a paper scanning device. Instead, the image is ^ *>n The image may also be checked ( in real time against a 

generated directly in the computer such as from a computer ***** of °f dc *? iIt0W - m$ verification may be 

graphics package. This can be done from a drawing that was ?y automated methods or by presenting the stored anduve 

originally created or altered on a computer. Of course, the image to an operator to verify Therefore, the person cashing 

most efficient compression of this data is identifying infor- ^ check 01 document can be immediately verified as the 

mation on how it was created and the customized informa- 30 P^ 011 10 whom me document * out t0 - 

tion. If such iiiformation is not available, then the pixel An efficient method of identifying the bank customer who 

image is processed similarly to the model-based image cashes the check is, if the person's image is present in the 

pattern recognition and compression system described database, to append a code to the financial transaction record 

above. including the check information as well as an index code to 

In general, when a code for the background information 35 an image of the person in the bank's database. Information 
is found in the database and the code then replaces the actual relating to the check* s casher only needs to be stored long 
background in the subtraction and compression method, the enough for the check to be cleared through the system and 
check can be reconstructed with different scales, resolutions for ^ originator to be notified of the check's disposition, 
and sizes, depending upon me requirements of the requestor 0nce mat takes P^ce and there is no question as to the 
of the copy. Further, when the compression system is 40 ^sher of me check, me comr^sed image can be removed 
broadly viewed as being for optimal compression of any of and onl y * e original information on the check needs to be 
a number of types of image data, the ability to reproduce the stored for me required length of time as per banking pro- 
entire image or portions thereof with varying resolution is cedures. If the person's video image is not in the data base, 
useful For example, if the scanner operates at 600 dpi, with men ^ entire compressed image (with the background of 
portions scanned at 2400 dpi, it is preferred that a copy be 45 me bank removed) may be sent, or retained by the bank, with 
produced at 300 dpi, a standard laser printer resolution, me cneck for verification of the payee until the check is 
unless this additional information is necessary. Likewise, if cleared 

the image is scanned at 600 dpi, while a 2400 dpi printer is As an alternative to video camera images or a scanned 
being employed, the matched background may be printed at photograph image, a fingerprint of the payee can be taken 
the full 2400 dpi resolution, derived from an exemplar 50 when the payee goes to the payee bank to cash or deposit the 
check, rather than the 600 dpi as scanned. If handwritten check. The fingerprint image is to be scanned with a special 
information is stored as B-spiine control points, this data scanner, which may include a laser or other mechanism 
may also be scaled in printer resolution manner and output known in the art. The resulting fingerprint image is com- 
at the maximum resolution available. Further, if desired, pressed using known techniques, and matched to a finger- 
signature information may be output at maximum 55 print identification database. If there is no match, then the 
resolution, while other information output at lower resolu- fingerprint stored in compressed form by known methods 
tion. The present system also allows reproduction of only a and the compressed image is sent with the compressed check 
portion of the image, e.g. the foreground, with elimination of image. As with the video image, the fingerprint image need 
the background This may be employed to provide a cus- not remain with the compressed check image for the 
tomer with a monthly statement where no irregularities are 60 required length of time as per banking standards, but only 
suspected. The present scalable output system also allows long enough for the payee to be verified as the intended 
"cameo" representations to be presented. payee. This is an added measure of security against fraud. 

The present variable resolution reproduction allows for a A further embodiment of the image pattern recognition 
high resolution computer output microfilm (COM) to be system according to the present invention is to implement 
produced from the compressed record, if such a backup is 65 the system using optical computing techniques. An optical 
desired. This system may actually allow a higher resolution image correlator may be used to identify the check back- 
image to be stored than by standard microfilming or micro- ground and produce a transformed residual image, which 
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may then be further compressed. The optical image correla- information. The various compressed information is concat- 

tor may be an electro-optic device operating on scanned enated into a single data file, and encrypted. The encrypted 

image data, which is projected by laser beam from a file is stored on the hard drive for later transmission by 

modulated light shutter onto a holographic crystal onto physical (floppy disk) or electronic transmission, 

which a plurality of latent background images have been 5 In a yet another embodiment, a check is imaged by a 600 

stored. When a background pattern matches one of the dpi two-sided line imaging scanner, with a scan width of 

stored patterns, the resulting image will differ, in a manner a b ou t 4.5 inches, and which acquires a color image in a 

indicative of which image is recognized, from an unrecog- single pass. The image is obtained with a 8 bit scale for each 

nized image. Further, the error signal after the optical co j or? thereby providing 24 bit color. The scanner has an 

correlator represents the correlator error, which is the fore- 10 automated feed, which sequentially advances the checks 

ground signal and any noise. through the scanner head. 

A full optical system is also possible, in which the check The scanner is interfaced by way of a SCSI interface to a 

is illuminated by laser, and the reflected image is then computer system. The computer system includes a central 

subjected to an optical correlator which includes latent processor, which is. e.g., an Intel Pentium class processor 

images of a plurality of check background images. The laser « with a kByte secondary cache, and 16 MBytes RAM. 

light passes through the crystal, and produces a pattern. The processor system is interfaced to a hard disk drive 

which is detected by an electronic image sensor, e.g. CCD naviDg i_ 2 GBytes of storage. The processor system is also 

or CID array, which represents a background identification interfaced to a quad or higher speed CD-ROM device. The 

(if any) and a residual which represents the foreground and nard ^ CD-ROM are preferably interfaced to the 

noise. 20 processor system by way of a SCSI interface. The processor 

In a further embodiment, a check is imaged by a 300 dpi system also includes a digital signal processor or dedicated 

line imaging scanner, with a scan width of about 4.5 inches. image processor as a coprocessor. The processor system also 

and which acquires a color image in a single pass. This has a floppy disk drive and a modem, 

scanner is arranged to obtain an image of both sides of the ft operation, both sides of a check are scanned by the 

check, by sequentially imaging both sides with a single 25 scanner. The information is transferred to the processor 

imaging scanner, or providing two opposed imaging scan- syst em through the SCSI interface. The entire image is 

ners. The image is obtained with a 6 bit scale for each color, buffered in RAM. The check image is first analyzed for the 

thereby providing 18 bit color. The scanner has a motorized appearance of an identifying code, such as a zebra code. If 

feed, which advances the check through the scanner head. A this is not identified, the image is then analyzed for infor- 

feeder may be provided, although checks may be fed manu- mation identifying the manufacturer or style of the check. If 

^y* this identifying information is not located, a portion of the 

The output of the scanner is fed into a data port of a check background is then extracted and analyzed. Certain 

computer system. The computer system includes a central characteristics are then identified in this portion. A database, 

processor, which is, e.g., an Intel 486 class processor (DX2- 35 previously transferred from the CD-ROM to the hard drive 

33, DX-50, DX4-75, DX4-100, etc.) with primary cache, is buffered in RAM and is searched to identify known 

with a 256 k Byte secondary cache, and 8-16M Bytes RAM. backgrounds which are possible matches for the unidentified 

The processor system is interfaced to a hard disk drive check. These possible matches are ranked according to ease 

having 250-1000 MBytes of storage. The processor system of access on the CD-ROM and likelihood of match, and are 

is also interfaced to a double speed CD-ROM device. The ^ sequentially retrieved from the hard drive and compared 

scanner and CD-ROM are preferably interfaced to the with the unidentified image, until a match is found in the 

processor system by way of a SCSI interface. The hard drive DSP of image processing coprocessor. If no match is found, 

may be either a SCSI or IDE device. The processor system the unidentified image is compressed by the coprocessor 

also has a floppy disk drive and a modem. according to known methods. 

In operation, a check is scanned by the scanner. The 45 If a match is found, the matched background is filtered 

information is transferred to the processor system through from the foreground image. The foreground image is com- 

the SCSI interface. The entire image is buffered in RAM. pressed by identification and characterization of printed 

The check image is first analyzed for the appearance of an information, and B-spline decomposition of handwritten 

identifying code, such as a zebra code. If this is not information by the coprocessor. The various compressed 

identified, the image is then analyzed for information iden- so information is transferred to the primary processor and 

tifying the manufacturer or style of the check. If this concatenated into a single data file, and encrypted. The 

identifying information is not located, a portion of the check encrypted file is stored on the hard drive far later transmis- 

background is then extracted and analyzed. Certain charac- sion by secure electronic transmission, 

teristics are then identified in this portion. A database. In a still another embodiment, a check initially scanned by 

resident on the CD-ROM, which is transferred to the hard 55 a sp&t ^ frequency detector, which detects a maximum level 

drive and buffered in RAM, is searched to identify back- Q f spatial detail of the front and back of a check to be 

grounds contained on the CD-ROM which are possible scanned. The check is then imaged by a 600 dpi two-sided 

matches for the unidentified check. These possible matches imaging scanner, with a scan width of about 4.5 inches, and 

are ranked according to ease of access on the CD-ROM and wh ich acquires a color image in a single pass. The image is 

likelihood of match, and are sequentially retrieved from the & obtained with a 8 bit scale for each color, thereby providing 

CD-ROM and compared with the unidentified image, until 24 bit color. Areas which are identified by the spatial 

a match is found. If no match is found, the unidentified frequency detector as having a spatial frequency of greater 

image is compressed according to known methods. than about 200 per inch are then scanned by a second 

If a match is found, the matched background is subtracted scanning system at a resolution of about 2400 dpL The 

from the foreground image. The foreground image is com- 65 scanner has an automated feed, which sequentially advances 

pressed by identification and characterization of printed the checks through the scanner head. The scanner is included 

information, and B-spline decomposition of handwritten in a system which can store images of a large number, Le. 
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greater than about 100, check images, including the high 
resolution image. 

The scanner is interfaced by way of a SCSI interface or a 
high speed link, e.g., FDDL, ATM, lOObase VG, etc., to a 
computer system. The computer system includes a closely 
linked parallel processing network, each node including a 
central processor, which is, e.g., an Intel Pentium class 
processor with a 256 kflyte secondary cache, and 16 MBytes 
RAM, and a CD-ROM drive. 
Database Management 

THE MERGE/PURGE PROBLEM 
The present task relates to the merging of two or more 
databases containing financial instrument image data, ox 
tables within databases, with potentially many hundreds of 
millions of records. For the sake of discussion, let us assume 
that each record of the database represents information about 
employees and thus contains, e.g., social security numbers, 
a single name field, and an address field as well as other 
significant information. Numerous errors in the contents of 
the records are possible, and frequently encountered. For 
example, names may be routinely misspelled, parts missing, 
salutations at times included, as well as nicknames in the 
same field. In addition, employees that are the subject of the 
listing may move, or marry thus increasing the variability of 
their associated records. Table 1 displays records with such 
errors that may commonly be found in mailing lists for junk 
mail, for example. 

There are two fundamental problems with performing a 
merge/purge procedure. First, the size of fee data sets 
involved is so large that only a small portion of the database 
can reside in the processor main memory (RAM) at any 
point in time. Thus, the database resides on external store 
(e.g., magnetic media) and any algorithm employed must be 
efficient, requiring as few passes over the full data set as 
possible. Quadratic time algorithms are not feasible in this 
environment Second, the mcoming new data has a statistical 
likelihood of corruption, from either purposeful or acciden- 
tal events, and thus the identification of matching data 
requires complex tests. Simple structural matching opera- 
tions (i.e., one field "equals** another) are not possible in all 
cases. Furthermore, the inference that two data items rep- 
resent the same domain entity may depend upon consider- 
able knowledge of the task domain. This knowledge depends 
on the particular application and is available to those skilled 
in the art working with the database. 
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334600443 


Lisa Boaidman 


144 Wars St. 


334600443 


Lisa Brown 


144 Waid St 


525520001 


Raman Bonilla 


38 Ward St 


525250001 


Raymond Bonilla 


38 Ward St 


0 


Diana D. Ambrosion 


40 Brik Church Av. 


0 


Diana A. Dambrosion 


40 Brick Church Av. 


0 


Colette Johnea 


600 113th St apt 5a5 


0 


Jcihn Colette 


600 113th SL ap. 585 


850982319 


Ivette A Kecgan 


23 Florida Av. 


950982319 


Yvctte A Kegan 


23 Florida St 
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EXAMPLE 1 

EXAMPLE OF MATCHING RECORDS 
DETECTED BY AN EQUATIONAL THEORY 
RULE BASE 

THE SORTED NEIGHBORHOOD METHOD 
Two approaches are available to obtain efficient execution 
of any solution: utilize parallel processing, and partition the 
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data to reduce the combinatorics of matching large data sets. 
Hence, a means of effectively partitioning the data set in 
such a way as to restrict attention to as small a set of 
candidates for matching as possible is required. 
Consequently, the candidate sets may be processed in par- 
allel. Furthermore, if the candidate sets can be restricted to 
a small subset of the data, quadratic time algorithms applied 
to each candidate set may indeed be feasible, leading to 
perhaps better functional performance of the merge task. 

One possible method for bringing matching records close 
together is sorting the records. After the sort, the comparison 
of records is then restricted to a small neighborhood within 
the sorted list This technique is referred herein as the sorted 
neighborhood method. The effectiveness of this approach is 
based on the quality of the chosen keys used in the sort 
Poorly chosen keys will result in a poor quality merge, i.e., 
data that should be merged will be spread out far apart after 
the sort and hence will not be discovered. Keys should be 
chosen so that the attributes with the most aUscrirninatory 
power should be the principal field inspected during the sort 
This means that similar and matching records should have 
nearly equal key values. However, since it is assumed that 
the data contains corruptions, and keys are extracted directly 
from the data, then the keys could also be corrupted. Thus, 
it is expected that a substantial number of matching records 
will not be caught In fact, experimental results, demonstrate 
this to be the case. 

Given a group of two or more database tables, they can 
first be concatenated into one sequential list of records and 
then processed according to the sorted neighborhood 
method. The sorted neighborhood method far solving the 
merge/purge problem can be summarized in three phases: 

Create Keys: Compute a key for each record in the list by 
extracting relevant fields or portions of fields. 

Sort Data: Sort the records in the data list using the key 
of step 1. 

Merge: Move a fixed size window through the sequential 
list of records limiting the comparisons for matching 
records to those recordsln the window. If the size of the 
window is w records, then every new record entering 
the window is compared with the previous records to 
find "matching" records. The first record in the window 
slides out of the window. 
When this procedure is executed serially, the create keys 
phase is an 0(N) operation, the sorting phase is 0(N log N), 
and the merging phase is O(wN), where w is the number of 
records in the database. Thus, the total time complexity of 
this method is 0(N log N) if w<[log N], 0(wN) omerwise. 
However, the constants in the equations differ greatly. It 
could be relatively expensive (i.e. require substantial com- 
putational resources to solve a problem having a high 
computational complexity) to extract relevant key values 
from a record during the create key phase. Sorting requires 
a few machine instructions to compare the keys. The merge 
phase requires the matching of a large number of rules to 
compare two records, and thus has the largest constant 
factor. Note, however, the dominant cost will be the number 
of passes over the data set during sorting (possibly as many 
as log N passes), an I/O bounded computation. 

CLUSTERING THE DATA FIRST 

Since sorting the data is the dominant cost of the sorted- 
neighborhood method, it is desirable to reduce the number 
of records that are sorted. An easy solution is to first partition 
the data into clusters using a key extracted from the data. 
The sorted-neighborhood method is then applied to each 
individual cluster. This approach is called the clustering 
method. 
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Given a group of two or more databases, these can first be records than the keys used for sorting. For example, suppose 

concatenated into one sequential list of records. The clus- two person names are spelled nearly (but not) identically, 

tering method can be summarized as a two phase process: and have the exact same address. It might be inferred they 

Cluster Data: Scan the records in sequence and for each th e same P^ 011 - On the other hand, supposing two 

record extract an n-attribute key and map it into an 5 records have exactly the same social security numbers, but 

n-dimensional cluster space. For instance, the first three me names and addresses are completely different, it could 

letters of the last name could be mapped into a 3D eitner assumed that the records represent the same person 

cluster space. who changed his name and moved, or the records represent 

Sorted-Neighborhood Method; The sorted-neighborhood different F* sons - f d socM securitv number field * 

method is applied independently on each cluster. It is 10 1 ™*» ct for ° ne ***** Wlthoul anv mr^er formation, 

not necessary, however, to recompute a key (step 1 of ?» ktter mi S ht be assumed more Wb^- ^ more 

the sorted neighborhood method). The key extracted J"«™fw» ^ere is in the records the better inferences can 

above for sorting may be employed. made - For exam P le > Michael Smith and Michele Smith 

When this procedure is executed serially, the cluster data could J 13 ^ ±G same address > and ±et names ,<reason " 

phase is an 0(N) operation, and assuming the data is 15 ? b ! v d ? s ? * is available, it could be 

partitioned into C equal sized clusters, the sorted- inferred that Michael and Michele are rnarried or sibmig^ 

neighborhood phase is 0(N log (N/Q). What is needed to specify for these inferences is an 

Clustering data as described above raises the issue of how equational theory that dictates the logic of domain 

well partitioned the data is after clustering. If the data from equivalence, not simply value or string equivalence. There 

which the n-attribute key is extracted is distributed uni- 20 m of course numerous methods of specifying the axioms of 

formly over its domain, then it can be expected that all the theory, including assembler code (presumably for 

clusters will have approximately the same number of records speed). Users of a general purpose merge/purge facility will 

in them. But real- world data is very unlikely to be uniformly benefit from higher level formalisms and languages 

distributed and thus, it must be expected that it will be permitting ease of experimentation and modification. For 

necessary to compute very large clusters and some empty 25 these reasons, it is preferred to employ a natural approach to 

clusters. specifying an equational theory and making it practical, 

Sometimes the distribution of some fields in the data is usin 8 a declarative rule language. Rule languages have been 

known, or can be computed as the data is inserted into the effectively used in a wide range of applications requiring 

database. For instance, a database may contain a field for inference over large data sets. Much research has been 

names. Lists of person names are available from which, e.g., 30 conducted to provide efficient means for their evaluation, 

the distribution of the first three letters of every name can be 30(1 technology can be exploited here for purposes of 

computed, thus providing a cluster space of bins (26 letters solving merge/purge. This technology is known to those 

plus the space). If such a list is unavailable, the name field skilled in the art. 

of the database tables may be randomly sampled to have an As an example, a simplified rule in English that exem- 

approximation of the distribution of the first three letters. In 33 plifies one axiom of the equational theory relevant to merge/ 

any case, it is easy to create a frequency distribution purge applied to the idealized employee database is shown 

histogram for several fields in the databases. All of this below: 
information can be gathered off-line before applying the 

Clustering method. Given two records, rl and rZ 

Assuming the data is divided into C clusters using a key 40 w ^ ° f rl rf * 

. . « j»~ , „ . . _ 45 / AND tho first names differ slightly, 

extracted from a particular field. Given a frequency distn- and the address of n equals^ address of rl 

bution histogram with B bins for that field (C^B), those B then 

bins (each bin represents a particular range of the field rl * equivalent to rl. 

domain) may be divided into C subranges. Let b, be the _ . , . c ltJ .„ .. . , „ .„ . . 

normalized frequency for the i' h bin of the histogram: 45 The implementation of "differ slightly" specified here in 

English is based upon the computation of a distance function 

B applied to the first name fields of two records, and the 

^ 1 bl ~ x comparison of its results to a threshold The selection of a 

distance function and a proper threshold is also a knowledge 

_ „ , . . ^ , , .50 intensive activity mat demands experimental evaluation. An 

Then for each of the C subranges the expected sum of the improperly chosen threshold will lead to either an increase 

frequencies over the subrange is close to 1/C (e.g., if bins s m me number of falsely matched records or to a decrease in 

to e, l^s^B are assigned to one cluster then it is the nU mber of matching records that should be merged. A 

expected: number of alternative distance functions were implemented 

e 55 and tested including distances based upon edit distance, 

i phonetic distance and "typewriter" distance. The results 

w presented below are based upon edit distance computation 

since the outcome of the program did not vary much among 

Each subrange will become one of the clusters and, given a the different distance functions. 

record, the key is extracted from the selected field, and map eo For the purpose of experimental study, an OPS5 rule 

the key into the corresponding subrange of the histogram. program consisting of 26 rules for this particular domain of 

The complexity of this mapping is, at worst, log B. employee records was used over relatively small databases 

pot t atttyvt a t TTTcnpv of records. See C. L. Forgy, "OPS 5 user's manual", Tech- 

nyuAllulNAL i iifcUKY ^ Re ^ n CMU . CS . U _ 135 ^ Carnegie Mellon University 

The comparison of records, during the merge phase, to 65 (July 1981). Once the performance of the rules is deemed 

determine their equivalence is a complex inferential process satisfactory, distance functions, and thresholds, the program 

that considers much more information in the compared was recoded with rules written directly in C to obtain 
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speed-up over the OPS5 implementation. Table 1 demon- 
strates a number of actual records this program correctly 
deems equivalent. Although compilers for rule languages 
exist, see D. P. Miranker, B. Lofaso, G. Farmer, A. Chandra, 
and D. Brant. **On a TREAT-based production system 5 
compiler". Proc. 10th Int'l Conf. on Expert Systems, pp 
617-630, (1990), there is still a significant gap in perfor- 
mance forcing the inevitable conversion to C. However, 
OPS5 provided an especially useful prototyping facility to 
define an equational theory conveniently. io 

USING THE TRANSITIVE CLOSURE OVER 
THE RESULTS OF INDEPENDENT RUNS 

Once an equational theory is specified for matching 
database records and converted to a program, the matching 15 
program is applied to a small subset of data, e.g., those 
records presently in the window of the sorted list The 
program output thus depends upon whether matching 
records appear in a window. Consequently, the effectiveness 
of the sorted neighborhood method highly depends on the 20 
key selected to sort the records. A key is defined to be a 
sequence of a subset of attributes, or substrings within the 
attributes, chosen from the record. (For example, the last 
name of the employee record may be chosen as a key, 
followed by the first non blank character of the first name 25 
field followed by the first six digits of the social security 
field, and so form.) 

In general, no single key will be sufficient to catch all 
matching records. Keys give implicit priorities to those 30 
fields of the records occurring at the beginning of the 
sequence of attributes over others. If the error in a record 
occurs in the particular field or portion of the field that is the 
most important part of the key, there is little chance mis 
record will end up close to a matching record after sorting. 35 
For instance, if an employee has two records in the database, 
one with social security number 193456782 and another 
with social security number 913456782 (the first two num- 
bers were transposed), and if the social security number is 
used as the principal field of the key, then it is very unlikely ^ 
both records will fall under the same window. Thus, the 
records will not be merged. The number of matching records 
missed by one run of the sorted neighborhood method can be 
comparatively large. 

To increase the number of similar records merged, two 45 
options can be explored. The first is simply widening the 
scanning window size by increasing w. Clearly this increases 
the complexity, and, as discussed in the next section, does 
not increase dramatically the number of similar records 
merged (unless of course the window spans the entire 50 
database, which as noted corresponds to an infeasible N 2 
operation). The alternative strategy is implemented to 
execute several independent runs of the sorted neighborhood 
method, each time using a different key and a relatively 
small window. For instance, in one run, the social security 55 
number might be used as the principal part of the key while 
in another run the last name of the employee might be used 
as the principal part of the key. Each independent run will 
produce a set of pairs of records which can be merged. The 
transitive closure is then applied to those pairs of records. 60 
Hie results will be a union of all pairs discovered by all 
independent runs, with no duplicates, plus all those pairs that 
can be inferred by transitivity. 

More particularly, as shown in FIG. 13, database 202 is 
subjected to step 204 wherein a key is computed for each 65 
record in database 202 by extracting at least a portion of a 
first field. Next, the records in database 202 are subjected to 



897 

38 

the technique of parallel merge sorting at step 6 (where 
multiple processors are used), or merge sorting (where a 
single processor is used). A predetermined number of 
sequential records sorted according to the key are compared 
to each other in step 208 to determine if one or more of the 
records match. Identifiers are created for any matching 
records and are stored in step 210. 

Where the process shown in FIG. 13 is repeated for 
multiple databases or clusters of records in one database, 
stored identifiers 1 and 2 are created as shown in boxes 12 
and 14 of FIG. 14. A union of these stored identifiers are 
created by step 216, and subjected to transitive closure as 
shown in step 218 of FIG. 14. 

In the following, several independent runs of the sorted 
neighborhood method are combined with the transitive 
closure of the results, which drastically improves the results 
of one run of the sorted neighborhood method A drawback 
of this combination is the need of several runs of the sorted 
neighborhood method. However, each independent run 
requires only a small search window. No individual run 
produced comparable results with large windows. Thus, the 
complexity of the merge phase for the sum total of these 
independent runs is smaller than the complexity of one run 
with a large window while its functional performance was 
far superior. 

EXPERIMENTAL RESULTS 

GENERATING THE DATABASES 

All databases used to test the sorted neighborhood method 
and the clustering method were generated automatically by 
a database generator program. This database generator 
allows the selection among a large number of parameters 
including, the size of the database, the percentage of dupli- 
cate records in the database, and the amount of error to be 
introduced in the duplicated records. The principal benefit of 
the generator is to perform controlled studies and to estab- 
lish the functional performance of the solution method. Each 
record generated consists of the following fields, some of 
which can be empty: social security number, first name, 
initial, last name, address, apartment, city, state, and zip 
code. The names are chosen randomly from a list of 63000 
real names. The cities, states, and zip codes (all from the 
U.SA.) come from publicly available lists. 

The noise introduced in the duplicate records can go from 
small typographical changes, to complete change of last 
names and change of addresses. When setting the parameters 
for the kind of typographical errors, known frequencies from 
studies in spelling correction algorithms were used See K. 
Kukich, 'Techniques for automatically correcting words in 
text", ACM Computing Surveys, 24(4)377^39 (1992). For 
this study, the generator selected from 10% to 50% of the 
generated records for duplication with noise. 

PRE-PROCESSING THE GENERATED 
DATABASE 

Pre-processing the records in the database prior to the 
merge/purge operation might increase the chance of finding 
two duplicate records. For example, names like Joseph and 
Giuseppe match in only three characters, but are the same 
name in two different languages, English and Italian. A 
nicknames database or name equivalence database could be 
used to assign a common name to records containing iden- 
tified nicknames. 

Since misspellings are introduced by the database 
generator, the results can probably be improved by running 
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a spelling correction program over some fields before sub- 
mitting the database to the sorting neighborhood method. 
Spelling correction algorithms have received a large amount 
of attention for decades. See Kukich, Supra. Most of the 
spelling correction algorithms considered use a corpus of s 
correctly spelled words from which the correct spelling is 
selected. A corpus for the names of the cities in the U.S.A. 
(18670 different names) is available and can be used to 
attempt correcting the spelling of the city field. The algo- 
rithm described by Bickel in M. A. BickeL "Automatic 10 
correction to misspelled names: a fourth-generation lan- 
guage approach". Communications of the ACM, 30(3) 
:224-228 (1987) was selected for its simplicity and speed. 
The use of spell corrector over the city field improved the 
percent of correctly found duplicated records by 15 
1.5%-2.0%. A greater proportion of the effort in inatching 
resides in the equational theory rule base. 



The purpose of this first experiment was to see how many 
duplicate records the sorted neighborhood method could 
find. Three independent runs of the sorted neighborhood 
method were run over each database, and a different key was 
used during the sorting phase of each independent run. On 25 
the first run the last name was the principal field of the key 
(Le., the last name was the first attribute in the key). On the 
second run. the first name was the principal field of the key. 
Finally, in the last run, the street address was the principal 
field of the key. The selection of the keys was purely 30 
arbitrary, and could have used the social-security number 
instead of. say, the street address. The data generator is 
assumed to be controlled, such that all fields are noisy and 
therefore it should not matter which fields are selected. 

35 

FIG. 7A shows the effect of varying the window size from 
2 to 50 records in a database with 1,000,000 records and 
with an additional 1423644 duplicate records with varying 
noise. A record may be duplicated more than once. Each 
independent run found between 50 and 70% of the dupli- ^ 
cated pairs. Increasing the window size does not help much 
and taking in consideration that the time complexity of the 
procedure goes up as the window size increases, it is 
obviously fruitless to use a large window size. 

The line marked as X-closure over 3 keys in FIG. 7A 45 
shows the results when the program computes the transitive 
closure over the pairs found by the four independent runs. 
The percent of duplicates found goes up to almost 90%. A 
manual inspection of those records not found as equivalent 
revealed that most of them are pairs that would be hard for 50 
even a human to identify without further information (e.g., 
both records do not have a social security number, the names 
are the same or very close, the street addresses are the same 
but in different states). 

However, the equational theory is not completely accu- 55 
rate. It can mark two records as similar when they are not the 
same real-world entity (false-positives). FIG. 8 shows the 
percent of those records incorrectly marked as duplicates as 
a function of the window size. The percent of false positives 
is almost insignificant for each independent run and grows 60 
slowly as the window size increases. The percent of false 
positives after the transitive closure is used is also very 
small, but grows faster than each individual run alone. This 
suggests that the transitive-closure may not be effective if 
the window size is very large. $ 5 

The number of independent runs needed to obtain good 
results with the computation of the transitive closure 
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depends on how corrupt the data is and the keys selected. 
The more corrupted the data, more runs might be needed to 
capture the matching records. Although not shown in FIG. 
7A, the sorted-neighborhood method, conducted with only 
two independent runs and computing the transitive closure 
over the results of those two runs, produced a percentage of 
detected duplicate records of between 70% to 80%. The 
transitive closure, however, is executed on pairs of record 
id's, each at most 30 bits in the present example, and in 
general log N bits, and fast solutions to compute transitive 
closure exist See R. Agarawal and H. V. Jagadish. "Multi- 
processor transitive closure dgorithms", Proc. Int'l Symp. 
on Databases in Parallel and Distributed Systems* pp 56-^66 
(December 1988). From observing real world scenarios, the 
size of the data set over which the closure is computed is at 
least one order of magnitude smaller than the matching 
database of records, and thus does not contribute a large 
cost. But note there is a heavy price due to the number of 
sorts of the original large data set 

ANALYSIS 

The approach of using multiple sorts followed by the 
transitive closure is referred to as the multi-pass approach. 
The natural question posed is when is the multi-pass 
approach superior to the single sorted neighborhood case? 
The answer to this question lies in the complexity of the two 
approaches for a fixed accuracy rate. The accuracy rate, as 
defined herein is the total percentage of "mergeable" records 
found. 

The complexity of the multi-pass approach is given by the 
time to create keys, the time to sort r times, wherein the 
present example r»3 times, and window scanning r times (of 
window size w) plus the time to compute the transitive 
closure: 

T^multi-pass^irVfCjriV log N+c 3 rwN+7{TQ 

where r is the number of passes, and T(TC) is the time for 
the transitive closure. The constants depict the costs for 
comparison only and are related as c 1 <c 2 «c 3 =ac 2 , where 
a> 1. From analyzing the experimental program, the window 
scanning phase contributes a constant, c 3 , which is at least 
a=3 times as large as the comparisons performed in sorting, 
while the create keys constant, c A , is roughly comparable to 
the comparisons used in sorting. Thus, for the purposes of 
the present analysis, it is assumed that c^c,, while C3=ac 2 . 
Hence, the constants are replaced in terms of the single 
constant c. The complexity of the closure is directly related 
to the accuracy rate of each pass and is certainly dependent 
upon the duplication in the database. However, it is assumed 
the time to compute the transitive closure on a database that 
is orders of magnitude smaller than the input database to be 
less than the time to scan the input database once (Le. less 
than linear in N, and contributes a factor of c 4 N<N). Thus, 

^muiii-vassy=crN+crN log N+CtcrwN+c+N^c+cr log NHterw)N+ 

for a window size of w. The complexity of the single pass 
sorted neighborhood approach is similarly given by: 

7{single.pass>=cAkeW log N-HteWN=(c±c log N+ar\V)N 

for a window size of W. 

For a fixed accuracy rate, the question is then for what 
value of W for the single pass sorted neighborhood method 
does the multi-pass approach perform better in time, i.e. 
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where w is the size of the merge phase window, and M is a 

, . „ 1HV ^, , „ x c ^ blocking factor. Furthermore, since very large databases are 

(c cfog/V CtcW)>(cr crlogN Ocnv) — rr— , ,f _ , . « . . , , ~ 

* the subject of this example, it is assumed that P<<N and 

OT MP<N. First, the input database is sorted in parallel using 

5 the well known technique of parallel merge sorting. Then, 

c*n the sorted database is divided into N/MP blocks. Each of the 
N/MP blocks is processed in turn as follows. Let r f represent 
record i in a block, 0^i<MP-l. Each processor p receives 

In the experiments performed and reported in the follow- records r^ 1)M , . . . f pM _^ . . . , T pM+>v _ 2y for l=p^P, (i.e„ 

ing sections, N=2 20 records, a is approximately 3, c is 10 eacn processor gets a partition of size M records plus the 

approximately 8xl0~ 5 , w=10, and T(TQ=c4N^180 sec- w-1 records of the next partition of the block). Then 

onds. Thus, the multi-pass approach dominates the single matching records can be searched independently at each 

sort approach when W>45. processor using a window of size w. This process is then 

FIG. 9 A shows the time required to run each independent repeated with the next block of records. The time for the 

run of the sorted-neighborhood method on one processor, is merge phase process under this scheme is, in theory, 0(wN/ 

and the total time required for the multi-pass approach. As **)• 

shown in FIG. 7A, the multi-pass approach was found to Each independent run of the sorted-neighborhood method 

produce an accuracy rate of 86.1% using a window size of is independent of other independent runs. Therefore, given 

10. The time performance of the single pass run is similar to n times more processors, independent runs may be executed 

the time performance of the multi-pass approach with w=10 20 concurrently and at the end compute the transitive closure 

when W=*56, a little over what was estimated above. But, over the results. 

the performance ratios of all single-pass runs in FIG. 1* at The sorted-neighborhood method was implemented on an 

W=50, are from 17% to 28%, well below the 86.1% per- HP cluster consisting of eight HP9000 processors intercon- 

formance of the multi-pass approach. To study how large the nected by a FDDI network. FIG. 11A shows the total time 

window size W must be for one of the single-pass runs to 25 taken for each of the three independent runs from FIG. 7 as 

achieve the same level of performance as the multi-pass the number of processors increases. The window size for all 

approach, the rule based equational theory was replaced with these runs was 10 records. FIG. 11A also includes the time 

a stub that quickly tells us if two records within the window it will take the sorted-neighborhood method to execute all 

are actually equal (thus the "ideal" performance is studied). three independent runs over three times the number of 

The results, depicted in FIG. 10, show that any single-pass 30 processor and then the computation of the transitive closure 

run would need a window size larger than W=50,000 to 0 f the results. Using the system described above, enough 

achieve the same performance level as the multi-pass processors to run all sorted-neighborhood runs concurrently 

approach using w=10. The 'Year performance lines in FIG. were unavailable, so that the time taken for all of the runs 

10 are those of FIG. 7A, which are included to provide a must be estimated from results of each independent run. All 

sense of how effective the present rule-based equational 35 independent runs were run serially and the results were 

theory is when compared with the ideal case. Thus, the stored on disk. The transitive closure was then computed 

multi-pass approach achieves dramatic improvement in time over the results stored on disk and the time measured for this 

and accuracy over a single-pass approach. Further, the operation. The total time if all runs are executed concur- 

multi-pass approach may also be parallelized, clearly mak- rently is, approximately, the maximum time taken by any 

ing the multi-pass the dominate method. 40 independent run plus the time to compute the closure. The 

speed-ups obtained as the number of processors grows are 

EXAMPLE 2 not exactly linear. The main reason far this is the inherent 

THE CLUSTERING METHOD S?!S?f!S ^JT^ te ° adeaSting 

the data to all processes. 
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The same experiment was repeated using the clustering 45 
method to first partition the data into clusters, using the same 

three keys used above for the sorted-neighborhood method THE CLUSTERING METHOD 

and ran three independent runs, one for each key. Then the « 1 ■ 1 ♦ * *u 1 „♦ • . 

transitive dotureW the results of all independent runs ™ e pa ^, el ° f L ^f^™^ 

was computed. The results are depicted in HD. 7B. Com- 50 ^ 88 ^ N , te me ™ nber . « t ^ atA& 

txttlngrperfoirnanceiesultemFIG.9B ) lti S notedthatthe ^bnt, ,P foe number of processors and X the number of 

PaformancTlevel is almost the same for both methods. The J°*?" i . to * , fMmed *** f**^. ^T- \ 

f- . MC „ U „ ^ ,kA« m ;« T7T<- od distribution histogram, its range is divided into CP sub- 

urmng results for these experiments are shown m FIG. 9B. ranges M ^ ^ processor is assigned C of 

EXAMPLE 3 55 those subranges. To cluster the data, a coordinator processor 

reads the database and sends each record to the appropriate 
PARALLEL IMPLEMENTATION processor. Each processor saves the received records in the 

proper local cluster. Once the coordinator finishes reading 
With the use of a centralized parallel or distributed and clustermg me data among me processors, aU processors 
network computer, a linear speedup over a serial computer ^ S0It and apply me window scanning method to their local 
is sought to be achieved. clusters. 

THE SORTED-NEIGHBORHOOD METHOD Load bakncin S of me operation becomes an issue when 

more than one processor is used and the histogram method 
The parallel implementation of the sorted-neighborhood does a bad job of partitioning the data The present system 
method is as follows. Let N be the number of records in the 65 attempts to do an initial static load balancing. The coordi- 
database. The implementation is presumed to have P nator processor keeps track of how many records it sent to 
processors, each processor being able to store M+w records, each processor (and cluster) and therefore it knows, at the 
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end of the clustering stage, how balanced the partition is. It 
then redistributes the clusters among processors using a 
simple longest processing time first (LFT) strategy. See R. 
Graham, "Bounds on multiprocessing timing anomalies", 
SIAM Journal of Computing, 17:416-429 (1969). That is, 5 
move the largest job in an overloaded processor to the most 
underloaded processor, and repeat until a "welT balanced 
load is obtained. Elements of this technique are known. See 
H. M. Dewan, M. A. Hernandez, J. Hwang, and S. Stolfo, 
'Predictive dynamic load balancing of parallel and distrib- 10 
uted rule and query processing". Proceedings of the 1994 
ACM Sigmod Conference (1994). 

The time results for the clustering method are depicted in 
FIG. 11B. These results are for the same database used to 
obtain the timing results for the sorted neighborhood 15 
method, a window size of 10 records, and 100 clusters per 
processor. Comparing the results in FIG. 11B with FIG. 11 
A, it is noted that the clustering method is, as expected, 
faster than the sorted-neighborhood method. 

20 

EXAMPLES 
SCALING UP 

Finally, the sorted-neighborhood method and clustering 25 
method are demonstrated herein to scale well as the size of 
the database increases. The present example is limited, by 
virtue of limitations in disk space in the experimental 
system, to databases up to about 3,000.000 records. Of 
course, larger systems could be implemented without this ^ 
limitation by providing more disk space. Again, three inde- 
pendent runs were run using the sorted-neighborhood 
method (and the clustering method), each with a different 
key, and then computed the transitive closure of the results. 
This was performed for the 12 databases as shown in Table 35 
2 and ran all the experiments assigning 4 processors to each 
independent run. The results are shown in FIG. 12A 
(Clustering Method) and FIG. 12B (Sorted-Neighborhood 
Method). As expected, the time increases linearly as the size 
of the databases increase. 



example, a DEC Alpha workstation, a RISC processor-based 
computer) would produce a total time that is at least half the 
estimated time. 

The present system may be applied to image data obtained 
by scanning financial documents, and can also be used to 
create a codebook of background data. In a first such 
embodiment, the above merge/purge techniques can be used 
to extract or eliminate redundant records. In this manner, the 
database is first created by scanning a number of financial 
documents. Thereafter, a first sort is performed using a key 
that is specific to background data (or other data, for that 
matter). Comparisons are made on a predetermined number 
of sequential records sorted according to the first key to 
determine whether a match occurs. Identifiers are stored for 
matched records. A second key is determined by extracting 
at least a portion of a second field, the records are sorted 
using the second key, comparison are made on a predeter- 
mined number of sequential records sorted in accordance 
with the second key to determine whether a match occurs. 
Identifiers are stored again for the matched records. A union 
of the stored identifiers is made, and the union is subjected 
to transitive closure. It should be appreciated that one 
processor can be used and the first and second sorts per- 
formed sequentially, or multiple processors can be used in a 
parallel processing environment as discussed herein. 

It should also be appreciated that a codebook database can 
be created using the merge techniques in conjunction with 
the scanning techniques discussed. In this embodiment, a 
database of scanned documents is created and sorted in 
accordance with a first key that may include. e.g., specific 
background identifiers. A predetermined window of sequen- 
tial records are then compared to determine if matches exist. 
Where matches exist, the background can be subtracted (a 
purge stage), the background data then sent to a codebook 
database and an identifier specific to the entry in the code- 
book database can be stored with the original record sans 
purged background. In this manner, a codebook database is 
created. 

Therefore, the present invention will improve the opera- 
tion of systems processing various large databases 



TABLE 2 



Original number 
of records 




Total records 




Total; 


size (Mbytes'} 


10 


30 


50 


10 


30 50 


500000 


584495 


754354 


924029 


45.4 


58.6 71.8 


1O0O00O 


1169233 


1508681 


1847606 


913 


118.1 144.8 


1500000 


1753892 


2262808 


2770641 


138.1 


178.4 218.7 


1750000 


2046550 


2639892 


3232258 


161.6 


208.7 255.7 



Using the graphs in FIGS. 12A and B, the time it will take 
to process 1 billion records using both methods may be 
estimated, assuming the time will keep growing linearly as 
the size of the database increases. For the sorted- 
neighborhood method, let us consider the last point of the 
"30" graph. Here, a database with 2,639,892 records was 
processed in 2172 seconds. Thus, given a database with 
1,000,000,000 records, approximately lxl0 9 x(2172/ 
263892) s=8.2276xl0 5 s« 10 days will be needed. Doing the 
same analysis with the clustering method, a database of size 
2,639,892 records was processed in 1621 seconds. Thus, 
given a database with 1.000,000.000 records, it is expected 
that approximately lxl0 9 x(1621/2639892) s=6.1404x 
10V77 days will be required. Of course, doubling the speed 
of the workstations and the channels used (which is possible 
today since the HP processors are slow compared to, for 



including, e.g. financial documents that have been scanned 
and the images stored, due to its ability to efficiently process 
large numbers of database records and merge corresponding 
records in the database. 
Equivalents 

The above description, figures and embodiments are 
provided not to limit the invention but to assist one skilled 
in the art in better understanding the invention contained 

60 herein. The inventor is not thereby limited to the preferred 
embodiments. For example, checks may be replaced herein 
any arbitrary form, document, computer generated image, 
fingerprint image, voiceprint, and so forth. The disclosure 
provided in terms of check processing and banking practices 

65 may be viewed as a equivalent to any arbitrarily organized 
document or form processing operation or system in any 
business, technical or organizational structure. 
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I claim: 

1. A method for processing at least two images and storing 
the images in a database, comprising the steps of: 

(a) scanning the image to create a first digital image 
thereof; 

(b) comparing said first digital image against a codebook 
of stored digital images; 

(c) matching said first digital image with one of said 
stored digital images; 

(d) producing an index code identifying said one of said 
stored digital images as having matched said first 
digital image; 

(e) subtracting said one of said stored digital images from 
said first digital image to produce a second digital 
image; 

(f) storing said second digital image together with its 
respective index code as a record in the database; 

(g) repeating steps (a) through (g) at least once for another 
image; 

(h) clustering said stored images with their respective 
index codes based upon the use of at least one key, 
wherein said at least, one key comprises a first key and 
second key; 

(i) computing said first key for each record in the database 
by extracting at least a portion of a first field; 

(j) merge sorting the records in the database using said 
first key; 

(k) comparing to each other a predetenriined number of 30 
sequential reports sorted according to said first key to 
determine if one or more of the records match; 

(I) storing identifiers for any matching records; 

(m) computing said second key for each record in the 
database by extracting at least a portion of a second 
field; 

(n) merge sorting the records in the database using said 
second key; 

(0) comparing to each other a predetermined number of 
sequential records sorted according to said second key 
to determine if one or mare of the records match; 

(p) storing identifiers for any matching records; 
(q) creating union of said stored identifiers; and 
(r) subjecting said union to transitive closure. 

2. The method of claim 1, where in the comparing steps 
are each performed on a separate processor. 

3. The method of claim 2, wherein the database is sorted 
in parallel using parallel merge sorting. 

4. The method of claim 3, wherein in each group of 
processors for each database: 

(1) N is a number of records; 

(ii) P is a number of processors, each processor p, 1 ^Ip=iP, 
being able to store M+w records; 

(a) w is a size of a merge phase window; and 

(b) M is a blocking factor; 

(ii) P is less than N; 

(iii) r, represents record i in a cluster, 0^i<MP-l; and 
(a) each said comparing step comprises the steps of: 

(i) dividing the sorted database into N/MP clusters; 

(ii) processing each of the N/MP clusters in turn by 
providing each processor p with records r^ 1>A ^ 

(iii) searching matching records independently at each 
processor using a window of the size w; and 
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(iv) repeating the processing step for a next cluster of 
records. 

5. The method of claim 4, wherein: 

(a) in each group of processors for each database: 

(i) N is a number of records in the database; 

(ii) P is a number of processors p, and l^p^P; and 

(iii) C is a number of clusters to be formed per 
processor p; and 

(b) each said comparing step comprises the steps of: 

(i) dividing the N records into CP subranges; 

(ii) assigning to each processor C clusters of the 
subranges; 

(iii) providing a coordinator processor which reads the 
database and sends 

each record to an appropriate processor where each record is 
received; 

(iv) saving each received record at each processor 
where it is received in a proper local cluster; 

(v) after the coordinator finishes reading and clustering 
the data among the processors, sorting and applying 
a window scanning method to the local clusters of 
each processor. 

6. The method according to claim 5, wherein the coordi- 
nator processor load balances the various processors using a 
simple longest processing time first strategy. 

7. A method for processing at least two images, storing the 
images in a database, and creating a codebook representing 
portions of at least one of said images, comprising the steps 
of: 

(a) scanning each image to create a first digital image 
thereof; 

(b) storing each scanned image as a record in a database; 

(c) computing a first key for each record in the database 
by extracting at least a portion of a scanned image to 
create a field image; 

(d) merge sorting the records in the database using the first 
key; 

(e) comparing to each other a predetermined number of 
sequential records sorted according to the first key to 
determine if another record has a field image that 
matches the field image; 

(f) creating an identifier specific for each such matching 
field image that identifies the particular record in which 
the match was found; 

(g) subtracting the matched field image from the identified 
record to produce a residual image; 

(h) storing the residual image with its specific identifier; 

(i) storing the subtracted, matched field image in a code- 
book database; 

(j) computing a second key for each record in the database 
by extracting at least a portion of a scanned image to 
create a second field image; 

(k) merge sorting the records in the database using the 
second key: 

0) comparing to each other a predetermined number of 

sequential records sorted according to the second key to 

determine if a record has a field image that matches the 

second field image; 
(m) storing an identifier specific for each such matching 

second field image that identifies the particular record 

in which the match was found; 
(n) subtracting the matched second field image from the 

identified record; and 
(o) storing the matched second field image in a codebook 

database. 
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8. The method of claim 7. further comprising the steps of (o) storing the residual image with its specific identifier; 
aeating a union of the records that matched the first and and 

second field images, and subjecting the union to transitive ^ storing ±e subtracted ^ etched second field image in 



closure. 



a codebook database. 



9.Amethodforp^^ 5 10 ^ mQ±od q{ ^ 9 rfsi ^ st 

3 m J' FT* 3 codeb0O ^f e P^ sen t tin g of CTeatmgaumonoftherecorckm^ 

portions of at least one of said images, comprising the steps ., . , . . . . 

Q f . second field images, and subjecting the union to transitive 

closure. 

(a) scanning each image to create a first digital image u ^ method of daim ? wherein ^ fidd ^ . 

thereof; 10 & 

portion of the background. 

(b) storing each scanned image as a record in a database; 12 The method of claim 7 ^ wherem ^ m 

(c) computing a first key for each record in the database checks. 

by extracting at least a portion of a scanned image to 13. The method of claim 7, wherein said residual image 

create a field image; 15 com prises handwritten information, further comprising the 

(d) merge sorting the records in the database using the first steps of detennining a plurality of spline control points of at 
^ ev ; least a portion of said residual image. 

(e) comparing to each other a predetermined number of 14. The method according to claim 7, wherein said 
sequential records sorted according to the first key to residual image comprises handwritten information, further 
determine if another record has a field image that 20 comprising the steps of deter minin g a plurality of Fourier 
matches the field image; coefficients of at least a portion of said residual image. 

(f) aeating an identifier specific for each such matching 15. The method according to claim 7, wherein said 
field image that identifies the particular record in which residual image comprises handwritten information, further 
the match was found; comprising the steps of determining a plurality of wavelets 

(g) subtracting the matched field image from the identified 25 of at least a portion of said residual image. 

record to produce a residual image; 16. The method according to claim 7, wherein said 

(h) storing the residual image with its specific identifier, residual image comprises handwritten information, further 

(i) storing the subtracted, matched field image in a code- comprising the steps of detennining a plurality of fractal 
book datebase; 30 transforms of at least a portion of said residual image. 

(j) computing a second key for each residual image in the 17 ' ^ method according to claim 7, wherein said 

database by extracting at least a portion thereof to residual image comprises handwritten information, further 

create a second field image; comprising the steps of deterrnining a plurality of spatial 

(k) merge sorting the residual images in the database patterns of at least a P 0 ** 00 of said residual ima 8P' 

using the second key- 35 The method according to claim 1, wherein each said 

0) comparing to each o'ther a predetermined number of ^^^FfT" .f ecte .^ om ^ e & 0U V con- 
sequential residual images sorted according to the ^^V^^ 

second key to determineif a residual image has a field T*™ che< J ue ^ etter <* «edit monetary instrument, food 

image thai matches the second field uTge; V™* f °T "£ ^ 

e ^wu^ me »^uu uciu mwgc, ^ inventory form, real estate document, official government 

(m) creating an identifier specific for each such matching form, brochures, instructional form, questionnaire form, 

second field image that identifies the particular residual laboratory data form, tax form, computer screen, stock 

image in which the match was found; certificate and bond certificate. 

(n) subtracting the matched second field image from the 

respective residual image; ***** 



